Probability Theory - Notes on AI

# Probability Theory [A Concrete Introduction to Probability (using Python)](https://github.com/norvig/pytudes/blob/master/ipynb/Probability.ipynb) ## Expectation The average value of some function $f(x)$ under a probability distribution $p(x)$ is called the expectation of $f(x)$ and is denoted by $\mathbb{E}[f]$. It is given by: $\mathbb{E}[f] = \sum_{x}p(x)f(x)$ For finite number N points drawn from the distribution, expectation can be approximated as $\mathbb{E}[f] \simeq \frac{1}{N} \sum_{n=1}^{N} f\left(x_{n}\right)$ A subscript is used to indicate which variable is being averaged over. Ex: $\mathbb{E_x}[f(x,y)]$ We can also consider a conditional expectation with respect to a conditional distribution so that $\mathbb{E}_{x}[f \mid y]=\sum_{x} p(x \mid y) f(x)$ ## Variance The expected quadratic distance between $f$ and its mean $\mathbb{E}[f]$ is the variance of $f$ i.e. $var[f] = \mathbb{E}[(f(x) - \mathbb{E}[f(x)])^2]$ Expanding out the square, variance can also be written as: $\operatorname{var}[f]=\mathbb{E}\left[f(x)^{2}\right]-\mathbb{E}[f(x)]^{2}$ ## Covariance For two random variables, covariance is defined by $\begin{aligned} \operatorname{cov}[x, y] &=\mathbb{E}_{x, y}[\{x-\mathbb{E}[x]\}\{y-\mathbb{E}[y]\}] \\ &=\mathbb{E}_{x, y}[x y]-\mathbb{E}[x] \mathbb{E}[y] \end{aligned}$ In case of vectors of two random variables, covariance matrix is defined as: $\begin{aligned} \operatorname{cov}[\mathbf{x}, \mathbf{y}] &=\mathbb{E}_{\mathbf{x}, \mathbf{y}}\left[\{\mathbf{x}-\mathbb{E}[\mathbf{x}]\}\left\{\mathbf{y}^{\mathrm{T}}-\mathbb{E}\left[\mathbf{y}^{\mathrm{T}}\right]\right\}\right] \\ &=\mathbb{E}_{\mathbf{x}, \mathbf{y}}\left[\mathbf{x} \mathbf{y}^{\mathrm{T}}\right]-\mathbb{E}[\mathbf{x}] \mathbb{E}\left[\mathbf{y}^{\mathrm{T}}\right] \end{aligned}$ ## Likelihood Function Given $X$ as a discrete randomly distributed variable and given $\gamma$ as the parameter of interest, the likelihood of $X$ given $\gamma$ follows respectively: $ \mathcal{L}_\gamma(X=x)=p(X=x \mid \gamma) \\ $ ## Joint probability The probability that $X$ will take the value $x_i$ and $Y$ will take the value $y_i$ is written $p(X=x_i, Y=y_i)$ and is called the joint probability of $X=x_i$ and $Y=y_i$. ## Marginal probability If Y={y1, y2....L}, we have by the sum rule of probability as: $ p\left(X=x_{i}\right)=\sum_{j=1}^{L} p\left(X=x_{i}, Y=y_{j}\right) $ $p(X=x_i)$ is called the marginal probability because it is obtained by marginalizing, or summing out, the other variables ($Y$). ## Conditional probability If we consider only instances for which $X = x_i$, then the fraction of such instances for which $Y=y_i$ is written $P(Y=yj|X=x_i)$ and is called the conditional probability of Y. For two events $A$ and $B$ with $P(B)>0$, the conditional probability of $A$ given that $B$ has occurred is defined as: $ P(A \mid B)=\frac{P(A \cap B)}{P(B)} . $ ## Rules of probability For two random variables $X$ and $Y$ Sum rule: $p(x) = \sum_{Y} p(X,Y)$ Product rule: $p(X,Y) = p(Y|X)p(X)$ Here $P(X,Y)$ is joint probability, $P(Y|X)$ is conditional probability and $p(X)$ is a marginal probability. ![[rules of probs 1.jpg]] ## Law of total probability Law of total probability expresses the total probability of an outcome which can be realized via several distinct events. $ P(A)=\sum_n P\left(A \mid B_n\right) P\left(B_n\right) $ ## Chain rule Chain rule is derived by successive application of product rule: $ \begin{aligned} \boldsymbol{P}(A, B, C, D, E) &=P(A \mid B, C, D, E) P(B, C, D, E) \\ &=P(A \mid B, C, D, E) P(B \mid C, D, E) P(C, D, E) \\ &=\ldots \\ =& P(A \mid B, C, D, E) P(B \mid C, D, E) P(C \mid D, E) P(D \mid E) P(E) \end{aligned} $ ## Probability Distributions $ \begin{array}{|l|l|l|l|} \hline \text { Distribution } & p(\mathbf{x} \mid \theta) & \text { Range of } \mathbf{x} & \text { Range of } \theta \\ \hline \text { Bernoulli } & \theta^{[x=1]}(1-\theta)^{[x=0]} & x \in\{0,1\} & 0 \leq \theta \leq 1 \\ \hline \text { Beta } & \frac{\Gamma\left(\theta_{1}+\theta_{0}\right)}{\Gamma\left(\theta_{1}\right) \Gamma\left(\theta_{0}\right)} x^{\theta_{1}-1}(1-x)^{\theta_{0}-1} & 0 \leq x \leq 1 & \theta_{1}, \theta_{0}>0 \\ \hline \text { Poisson } & \frac{\theta^{x}}{x !} e^{-\theta} & x \in 0,1,2, \ldots & \theta>0 \\ \hline \text { Gamma } & \frac{\theta_{1}^{\theta}}{\Gamma\left(\theta_{0}\right)} x^{\theta_{0}-1} e^{-\theta_{1} x} & x>0 & \theta_{1}, \theta_{0}>0 \\ \hline \text { Gaussian } & \frac{1}{x \sqrt{2 \pi \theta_{1}}} e^{-\frac{1}{2}\left(\frac{x-\theta_{0}}{\theta_{1}}\right)^{2}} & -\infty<x<\infty & -\infty<\theta_{0}<\infty, \theta_{1}>0 \\ \hline \end{array} $ ### Discrete distributions #### Bernoulli Single trial with success/failure outcomes - $\operatorname{P}(X=1)=p=1-\operatorname{P}(X=0)=1-q$ - p is the probability of success, 1-q is of failure - $E[X] = p V[X] = pq$ - Conjugate prior is Beta distribution #### Binomial Number of successes in n independent binary trials - $P(X=k)=\left(\begin{array}{l}n \\ k\end{array}\right) p^k(1-p)^{n-k} ; k=0,1,2, \ldots, n$ - $E[X]=n p \quad V[X]=n p(1-p)$ - Conjugate prior is Beta distribution #### Categorical Generalization of Bernoulli distribution to categorical random variable i.e. there are k number of outcomes (rather than 0 and 1). $ f(x=i \mid \boldsymbol{p})=p_i, $ where $\boldsymbol{p}=\left(p_1, \ldots, p_k\right), p_i$ represents the probability of seeing element $i$ and $\sum_{i=1}^k p_i=1$. - Conjugate prior is Dirichlet Distribution #### Multinomial Generalization of Binomial distribution to categorical random variables over n trials. Suppose one does an experiment of extracting $n$ balls of $k$ different colors from a bag, replacing the extracted balls after each draw. Balls of the same color are equivalent. Denote the variable which is the number of extracted balls of color $i(i=1, \ldots, k)$ as $X_i$ and denote as $p_i$ the probability that a given extraction will be in color $i$. The probability mass function of this multinomial distribution is: $ \begin{aligned} f\left(x_1, \ldots, x_k ; n, p_1, \ldots, p_k\right) & =\operatorname{Pr}\left(X_1=x_1 \text { and } \ldots \text { and } X_k=x_k\right) \\ & = \begin{cases}\frac{n !}{x_{1} ! \cdots x_{k} !} p_1^{x_1} \times \cdots \times p_k^{x_k}, & \text { when } \sum_{i=1}^k x_i=n \\ 0 & \text { otherwise, }\end{cases} \end{aligned} $ for non-negative integers $x_1, \ldots, x_k$ - Conjugate prior is Dirichlet Distribution #### Poisson Number of rare events in a fixed time/space interval when events happen at a constant average rate and are independent of the time since the last event. - $P(X = k) = \frac{e^{-\lambda} \lambda^k}{k!}$ - k is the number of events and $\lambda$ is the average rate at which event occurs in the given interval ### Continuous distributions #### Beta The beta distribution is a family of continuous probability distributions defined on the interval $[0,1]$ or $(0,1)$ in terms of two positive parameters, denoted by alpha $\alpha$ and beta $\beta$. - $p(x ; \alpha, \beta) = \frac{1}{B(\alpha, \beta)} x^{\alpha-1}(1-x)^{\beta-1}$ where $\mathrm{B}(\alpha, \beta)=\frac{\Gamma(\alpha) \Gamma(\beta)}{\Gamma(\alpha+\beta)}$ - $\alpha, \beta > 0$ - $0 \leq x \leq 1$ - $\mathrm{E}[X]=\frac{\alpha}{\alpha+\beta}$ - $\operatorname{var}[X]=\frac{\alpha \beta}{(\alpha+\beta)^2(\alpha+\beta+1)}$ - Conjugate prior to Bernoulli and Binomial distribution #### Dirichlet Distribution The Dirichlet distribution is the multivariate generalization of the Beta distribution. It is a probability distribution describing probabilities of outcomes. Instead of describing probability of one of two outcomes a Bernoulli trial, like the Beta distribution does, it describes probability of $K$ outcomes. The Beta distribution is the special case of the Dirichlet distribution with $K=2$. $ f\left(x_1, \ldots, x_K ; \alpha_1, \ldots, \alpha_K\right)=\frac{1}{\mathrm{~B}(\boldsymbol{\alpha})} \prod_{i=1}^K x_i^{\alpha_i-1} $ The normalizing constant is the multivariate beta function, which can be expressed in terms of the gamma function: $ \mathrm{B}(\boldsymbol{\alpha})=\frac{\prod_{i=1}^K \Gamma\left(\alpha_i\right)}{\Gamma\left(\sum_{i=1}^K \alpha_i\right)}, \quad \boldsymbol{\alpha}=\left(\alpha_1, \ldots, \alpha_K\right) $ #### Gamma A gamma distribution is a general type of statistical distribution that is related to the beta distribution and arises naturally in processes for which the waiting times between Poisson distributed events are relevant. Gamma distributions have two free parameters, labeled $\alpha$ and $\beta$. $ f(x ; \alpha, \beta)=\frac{x^{\alpha-1} e^{-\beta x} \beta^\alpha}{\Gamma(\alpha)} \quad \text { for } x>0 \quad \alpha, \beta>0, $ where $\Gamma(\alpha)$ is the gamma function. For all positive integers, $\Gamma(\alpha)=(\alpha-1) !$ #### Other - [[Gaussian Distribution]] [[Maximum Entropy Principle]]