Logistic Regression - Notes on AI

# Logistic Regression In probabilistic discriminative models, we directly model the posterior class probabilities instead of first modelling class conditional and prior probabilities as in [[Probabilistic Generative Models]]. Assume a dataset $\mathbf{X}=\left(\mathbf{x}_{1}, \ldots, \mathbf{x}_{N}\right)^{T}$ wit binary targets $\mathbf{t}=\left(t_{1}, \ldots, t_{N}\right)^{T}$. In probabilistic discriminative linear models, posteriors are modeled as nonlinear functions with linear function of $\boldsymbol{\phi}$ as input. $ p\left(\mathcal{C}_{k} \mid \boldsymbol{\phi}, \mathbf{w}\right)=f\left(\mathbf{w}^{T} \boldsymbol{\phi}\right) $ We know from [[Probabilistic Generative Models]], posterior can be written as a logistic sigmoid acting on a linear function of the feature vector so that $ p\left(\mathcal{C}_{1} \mid \boldsymbol{\phi}\right)=y(\boldsymbol{\phi})=\sigma\left(\mathbf{w}^{\mathrm{T}} \boldsymbol{\phi}\right) $ $ p\left(\mathcal{C}_{2} \mid \boldsymbol{\phi}, \mathbf{w}\right)=1-\sigma\left(\mathbf{w}^{T} \boldsymbol{\phi}\right) $ When directly modeling the posterior, the number of parameters scales linearly with $M$ where $\mathbf{w}, \boldsymbol{\phi} \in \mathbb{R}^M$. By constrast if we would have used Gaussian class conditional densities using maximum likelihood, we would have used $2M$ parameters for means and $M(M+1)/2$ parameters for the shared covariance matrix. Together with class prior $p(C_1)$ this gives a total of $M(M+5)/2+1$ parameters, which grows quadratically with $M$. For **K classes**, generalized logistic regression is given as $ p\left(\mathcal{C}_{k} \mid \phi(\mathbf{x}), \mathbf{w}_{1}, \ldots, \mathbf{w}_{K}\right)=y_{k}(\phi)=\frac{\exp \left(a_{k}\right)}{\sum_{j=1}^{K} \exp \left(a_{j}\right)} $ Assuming iid data, the liklihood function can be written as $ \begin{aligned} p\left(\mathbf{T} \mid \boldsymbol{\Phi}, \mathbf{w}_{1}, \ldots, \mathbf{w}_{K}\right) &=\prod_{n=1}^{N} \prod_{k=1}^{K} y_{k}\left(\boldsymbol{\phi}_{n}\right)^{t_{n k}} \\ \log p\left(\mathbf{T} \mid \boldsymbol{\Phi}, \mathbf{w}_{1}, \ldots, \mathbf{w}_{K}\right) &=\sum_{n=1}^{N} \sum_{k=1}^{K} t_{n k} \log y_{k}\left(\boldsymbol{\phi}_{n}\right) \end{aligned} $ And if we induce a gaussian prior (makes MAP estimate equivalent to regularized logistic regression) $ \begin{array}{c} p\left(\mathbf{w}_{1}, \ldots, \mathbf{w}_{K} \mid \alpha\right)=\prod_{k=1}^{K} \mathcal{N}\left(\mathbf{w}_{k} \mid \mathbf{0}, \alpha^{-1} \mathbf{I}\right)=\prod_{k=1}^{K} \frac{1}{(2 \pi)^{M / 2} \sqrt{|\alpha \mathbf{I}|}} \exp \left(-\frac{\alpha}{2} \mathbf{w}_{k}^{T} \mathbf{w}_{k}\right)=\left(\frac{\alpha}{2 \pi}\right)^{\frac{K M}{2}} \exp \left(-\frac{\alpha}{2} \sum_{k=1}^{K} \mathbf{w}_{k}^{T} \mathbf{w}_{k}\right) \\ \log p\left(\mathbf{w}_{1}, \ldots, \mathbf{w}_{K} \mid \alpha\right)=\frac{K M}{2} \log \frac{\alpha}{2 \pi}-\frac{\alpha}{2} \sum_{k=1}^{K} \mathbf{w}_{k}^{T} \mathbf{w}_{k} \end{array} $ The gradient of log-likelihood is (using the derivative of softmax function $\frac{\partial y_{k}}{\partial a_{j}}=y_{k}\left(\mathbb{I}_{k j}-y_{j}\right)$) $ \begin{aligned} \frac{\partial \log p\left(\mathbf{T} \mid \Phi, \mathbf{w}_{1}, \ldots, \mathbf{w}_{K}\right)}{\partial \mathbf{w}_{j}} &=\sum_{n=1}^{N} \sum_{k=1}^{K} \frac{t_{n k}}{y_{k}\left(\phi_{n}\right)} \frac{\partial y_{k}\left(\phi_{n}\right)}{\partial \mathbf{w}_{j}} \\ &=\sum_{n=1}^{N} \sum_{k=1}^{K} \frac{t_{n k}}{y_{k}\left(\phi_{n}\right)} y_{k}\left(\phi_{n}\right) \phi_{n}^{T}\left(\mathbb{I}_{k j}-y_{j}\left(\phi_{n}\right)\right) \\ &=\sum_{n=1}^{N} \sum_{k=1}^{K} t_{n k} \phi_{n}^{T}\left(\mathbb{I}_{k j}-y_{j}\left(\phi_{n}\right)\right) \\ &=\sum_{n=1}^{N} \sum_{k=1}^{K} t_{n k} \phi_{n}^{T} \mathbb{I}_{k j}-\sum_{n=1}^{N} \sum_{k=1}^{K} t_{n k} y_{j}\left(\phi_{n}\right) \phi_{n}^{T} \\ &=\sum_{n=1}^{N} \phi_{n}^{T} \sum_{k=1}^{K} t_{n k} \mathbb{I}_{k j}-\sum_{n=1}^{N} y_{j}\left(\phi_{n}\right) \phi_{n}^{T} \sum_{k=1}^{K} t_{n k} \\ &=\sum_{n=1}^{N}\left(t_{n j}-y_{j}\left(\phi_{n}\right)\right) \phi_{n}^{T} \end{aligned} $ And the gradient of log-likelihood is $ \frac{\partial \log p\left(\mathbf{w}_{1}, \ldots, \mathbf{w}_{K} \mid \alpha\right)}{\partial \mathbf{w}_{j}}=-\alpha \mathbf{w}_{j}^{T} $ Gradient of log-posterior is $ \frac{\partial \log p\left(\mathbf{w}_{1}, \ldots, \mathbf{w}_{K} \mid \mathbf{T}, \boldsymbol{\Phi}, \alpha\right)}{\partial \mathbf{w}_{j}}=\sum_{n=1}^{N}\left(t_{n j}-y_{j}\left(\boldsymbol{\phi}_{n}\right)\right) \boldsymbol{\phi}_{n}^{T}-\alpha \mathbf{w}_{j}^{T} $ ## Gradient based iterative optimization Now let's define an error function for **binary class** by taking the negative log of the likelihood, which gives us the [[Cross entropy]] error function in the form $ E(\mathbf{w})=-\ln p(\mathbf{t} \mid \mathbf{w})=-\sum_{n=1}^{N}\left\{t_{n} \ln y_{n}+\left(1-t_{n}\right) \ln \left(1-y_{n}\right)\right\} $ Taking the gradient of the error function with respect to $\mathbf{w}$, we obtain, $ \nabla E(\mathbf{w})=\sum_{n=1}^{N}\left(y_{n}-t_{n}\right) \boldsymbol{\phi}_{n} $ In generalized linear models, when we use activation functions with the property $f(\psi(y)) = y$, i.e. activation function = link function such as sigmoid and softmax, the gradient of the error reduces to a simple form of $\nabla \ln E(\mathbf{w})=\frac{1}{s} \sum_{n=1}^{N}\left\{y_{n}-t_{n}\right\} \phi_{n}$ See Bishop 4.3.6 more on Canonical link functions. The optimal weights can be obtained iteratively using [[Stochastic Gradient Descent]] or with iterative reweighted least squares. Logistic regression doesn't suffer from the problem of outliers like [[Least squares for classification]] does. ## Optimization with SGD $y_n = \sigma(\mathbf{w}^T\phi_n)$ is non linear wrt to $w$, so no closed form solution exists. But it is convex, so it is guarenteed to converge with [[Stochastic Gradient Descent]]. We have the cross entropy error function, $ E(\mathbf{w})=-\ln p(\mathbf{t} \mid \mathbf{w})=-\sum_{n=1}^{N}\left\{t_{n} \ln y_{n}+\left(1-t_{n}\right) \ln \left(1-y_{n}\right)\right\} $ where $y_n = y(\boldsymbol{\phi})=\sigma\left(\mathbf{w}^{\mathrm{T}} \boldsymbol{\phi}\right)$ The update rule given a random data point $(\mathbf{x}_n,t_n)$ $ \mathbf{w}^{(\tau+1)}=\mathbf{w}^{(\tau)}-\eta \nabla E_{n}\left(\mathbf{w}^{(\tau)}\right)^{T} $ The gradient is given as $ \nabla E_{n}(\mathbf{w})=\left(\frac{\partial E_{n}(\mathbf{w})}{\partial w_{0}}, \ldots, \frac{\partial E_{n}(\mathbf{w})}{\partial w_{M-1}}\right) $ where, $ \frac{\partial E_{n}(\mathbf{w})}{\partial w_{j}}=\frac{\partial E_{n}({\mathbf{w}})}{\partial y_{n}} \frac{\partial y_{n}}{\partial \mathbf{w}_{j}}=\left(-\frac{t_{n}}{y_{n}}+\frac{1-t_{n}}{1-y_{n}}\right) \cdot \frac{\partial y_{n}}{\partial \mathbf{w}_{j}} $ and $ \begin{align} \frac{\partial y_{n}}{\partial w_{j}}&=\frac{\partial}{\partial w_{j}} \sigma\left(\mathbf{w}^{T} \phi_{n}\right)\\ &= \sigma(\mathbf{w}^{T} \phi_{n})(1-\sigma(\mathbf{w}^{T} \phi_{n})).\frac{\partial \mathbf{w}^T\phi_n}{\partial \mathbf{w}_j} \\ &= y_n(1-y_n)\phi_{nj} \\ \end{align} $ Therefore we have, $ \begin{align} \frac{\partial E_{n}(\mathbf{w})}{\partial w_{j}}&=-\frac{t_{n}}{y_{n}} \frac{\partial y_{n}}{\partial w_{j}}+\frac{1-t_{n}}{1-y_{n}} \frac{\partial y_{n}}{\partial w_{j}} \\ &= (y_n - t_n)\phi_{nj} \\ \end{align} $ Therefore the final update rule is given as, $ \mathbf{w}^{(\tau+1)}=\mathbf{w}^{(\tau)}-\eta\left(y_{n}^{(\tau)}-t_{n}\right) \phi\left(\mathbf{x}_{n}\right) $ From this update rule, we can see that logistic regression acts like a soft version of [[Perceptron]]. ## Newton-Raphson Iterative Optimization (IRLS) Newton-Raphson uses second order derivatives and doesn't require a step size (learning rate). This is because we a local quadratic approximation of the log likelihood function, so it will have one optimal point which is used for the next step. This method takes into account the curvature of the loss surface instead of depending on local linear approximation given by the gradient. Given the old estimate $\mathbf{w}^{(\tau - 1 )}$, approximate $E(\mathbf{w})$ with a second order [[Taylor Expansion]] around $\mathbf{w}^{(\tau-1)}$, $ E(\mathbf{w}) \approx \tilde{E}\left(\mathbf{w}^{(n-1)}+\Delta \mathbf{w}\right)=E\left(\mathbf{w}^{(n-1)}\right)+(\Delta \mathbf{w})^{T} \nabla^{T} E\left(\mathbf{w}^{(n-1)}\right)+\frac{1}{2}(\Delta \mathbf{w})^{T} \mathbf{H} \Delta \mathbf{w} $ Now to choose $\Delta \mathbf{w}$ such that next estimate $\mathbf{w}^{(\tau)}=\mathbf{w}^{(\tau-1)}+\Delta \mathbf{w}$ minimizes $\hat{\mathrm{E}}(\mathbf{w})$, $ \begin{align} \frac{\partial}{\partial \Delta \mathbf{w}} \tilde{E}\left(\mathbf{w}^{(\tau-1)}+\Delta \mathbf{w}\right)&= \nabla E+(\Delta w)^{\top} H \end{align} $ Setting this to zero, we get $\Delta w=-H^{-1} \nabla^{\top} E$ This gives the final update rule as, $ \mathbf{w}^{(\tau)}=\mathbf{w}^{(\tau-1)}-\mathbf{H}^{-1} \nabla^{T} E\left(\mathbf{w}^{(\tau-1)}\right) $ where, gradient and Hessian are given as ($\boldsymbol{\Phi} \in \mathbb{R}^{N\times M}$ ) $ \nabla E_{n}(\mathbf{w})^{T}=\left(\frac{\partial E_{n}(\mathbf{w})}{\partial w_{0}}, \ldots, \frac{\partial E_{n}(\mathbf{w})}{\partial w_{M-1}}\right)^{T}=\left(y_{n}-t_{n}\right) \phi_{n} = \boldsymbol{\Phi}^{T}(\boldsymbol{y}-\boldsymbol{t}) $ $ \mathbf{H}=\nabla \nabla E(\mathbf{w})=\sum_{n=1}^{N} y_{n}\left(1-y_{n}\right) \boldsymbol{\phi}_{n} \boldsymbol{\phi}_{n}^{\mathrm{T}}=\mathbf{\Phi}^{\mathrm{T}} \mathbf{R} \boldsymbol{\Phi} $ Note that we have introduced a diagonal matrix $\mathbf{R} \in \mathbb{R}^{N \times N}$ with elements $R_{n n}=y_{n}\left(1-y_{n}\right)$. --- An error function is convex when it's Hessian is positive definite, meaning $\forall_{\mathbf{w} \neq \mathbf{0} \in \mathbb{R}^{M}}: \mathbf{w}^{T} \mathbf{H} \mathbf{w}>0$. Here, the Hessian of $E(\mathbf{w})$ is given by $\mathbf{H}=\mathbf{\Phi}^{T} \mathbf{R} \mathbf{\Phi}$ with $\mathbf{R}=\operatorname{diag}_{N \times N}\left\{y_{n}\left(1-y_{n}\right)\right\}$. Now to check if the Hessian is positive definite, $ \begin{align} \mathbf{w}^T\mathbf{H}\mathbf{w}&= \mathbf{w}^T\boldsymbol{\Phi}^T\boldsymbol{R}\boldsymbol{\Phi}\mathbf{w}\\ &= \mathbf{w}^T\boldsymbol{\Phi}^T\boldsymbol{R}^{\frac{1}{2}}\boldsymbol{R}^{\frac{1}{2}}\boldsymbol{\Phi}\mathbf{w} \\ &= (\mathbf{R}^{\frac{1}{2}}\boldsymbol{\Phi}\mathbf{w})^T(\mathbf{R}^{\frac{1}{2}}\boldsymbol{\Phi}\mathbf{w}) \\ &= ||\mathbf{R}^{\frac{1}{2}}\boldsymbol{\Phi}\mathbf{w}||^2 \gt 0 \\ \end{align} $ where $\mathbf{R}^\frac{1}{2} = \operatorname{diag}\left(\sqrt{y_{n}\left(1-y_{n}\right)}\right)$ and $0<y_{n}<1$. Thus the error function is a concave function of $\mathbf{w}$ and hence has a unique minimum. --- The Newton-Raphson update formula for the logistic regression model then becomes, $ \begin{align} \mathbf{w}^{(\tau)} &=\mathbf{w}^{(\tau-1)}-\left(\mathbf{\Phi}^{T} \mathbf{R} \Phi\right)^{-1} \mathbf{\Phi}^{T}(\mathbf{y}-\mathbf{t}) \\ &= \left(\mathbf{\Phi}^{\mathrm{T}} \mathbf{R} \boldsymbol{\Phi}\right)^{-1}\left\{\mathbf{\Phi}^{\mathrm{T}} \mathbf{R} \boldsymbol{\Phi} \mathbf{w}^{(\tau -1)}-\mathbf{\Phi}^{\mathrm{T}}(\mathbf{y}-\mathbf{t})\right\} \\ &= \left(\boldsymbol{\Phi}^{\mathrm{T}} \mathbf{R} \boldsymbol{\Phi}\right)^{-1} \boldsymbol{\Phi}^{\mathrm{T}} \mathbf{R} \mathbf{z} \\ \end{align} $ where $\mathbf{z}$ is an N-dimensional vector with elements $\mathbf{z}=\boldsymbol{\Phi} \mathbf{w}^{(\tau-1)}-\mathbf{R}^{-1}(\mathbf{y}-\mathbf{t})$. Note that the solution takes the form of weighted least-squares problem. Because the weighting matrix $\mathbf{R}$ is not constant but depends on the parameter vector $\mathbf{w}$, we must apply it iteratively, each time using the new weight vector to compute new weigting matrix $\mathbf{R}$. Therefore, this algorithm is known as _iterative reweighted least squares_ or IRLS. ## SGD vs. Newton-Raphson The benefits over Newton-Raphson over SGD are that: 1. It requires no step size 2. Has faster convergence The benefits of SGD over Newton-Raphson are: 1. Newton-Raphson requires computing the hessian, and inverting the hessian in every step, whose complexity is cubic. Therefore, it doesn't scale for large datasets. 2. Smaller step size is enough to converge. ![[Screenshot 2020-09-23 at 3.23.05 PM.jpg]]