# Mixture of Experts Mixutre of experts model combines the capability of model combination by allowing the mixing coefficients themselves to be function of the input variable, so that: $ p(\mathbf{t} \mid \mathbf{x})=\sum_{k=1}^{K} \pi_{k}(\mathbf{x}) p_{k}(\mathbf{t} \mid \mathbf{x}) $ This is known as the _mixture of experts_ models, in which the mixing coefficients $\pi_{k}(\mathbf{x})$ are known as the gating functions, and individual component densities $p_{k}(\mathbf{t} \mid \mathbf{x})$ are called experts. The idea is that the gating functions determine which experts are dominant in their decision region. The gating functions need to satisfy the usual constraints of $0 \leqslant \pi_{k}(\mathbf{x}) \leqslant 1$ and $\sum_{k} \pi_{k}(\mathbf{x})=1$, so can be modeled by linear softmax models $ \pi_{n k}=\frac{\exp \left(\phi_{k}^{T} \boldsymbol{x}_{n}\right)}{\sum_{j} \exp \left(\phi_{j}^{T} \boldsymbol{x}_{n}\right)} $ Let $\Theta$ be a matrix in $R^{D \times K}$ that contains the $D$ -dimensional column vector of parameters for each expert, and $\Phi$ is a matrix in $\mathbb{R}^{D \times K}$ that contains all of the parameters of the routing function. Assuming iid, we can formulate likelihood of the entire dataset as $ p(\mathbf{y} \mid \mathbf{X}, \mathbf{\Theta}, \mathbf{\Phi})=\prod_{n=1}^{N} \sum_{k=1}^{K} p\left(z_{n}=k \mid \boldsymbol{x}_{n}, \boldsymbol{\Phi}\right) p\left(y_{n} \mid \boldsymbol{x}_{n} z_{n}=k, \boldsymbol{\theta}_{k}\right) $ And its log as $ \log p(\mathbf{y} \mid \mathbf{X}, \mathbf{\Theta}, \mathbf{\Phi})=\sum_{n=1}^{N} \log \left(\sum_{k=1}^{K} p\left(z_{n}=k \mid \boldsymbol{x}_{n}, \boldsymbol{\Phi}\right) p\left(y_{n} \mid \boldsymbol{x}_{n}, z_{n}=k, \boldsymbol{\theta}_{k}\right)\right) $ Let $\mathcal{L}=\log p(\mathbf{y} \mid \mathbf{X}, \mathbf{\Theta}, \boldsymbol{\Phi})$. Then, $ \begin{array}{l} \frac{\partial \mathcal{L}}{\partial \boldsymbol{\theta}_{i}}=\sum_{n=1}^{N} \frac{p\left(z_{n}=i \mid \boldsymbol{x}_{n}, \boldsymbol{\Phi}\right)}{\sum_{j=1}^{K} p\left(z_{n}=j \mid \boldsymbol{x}_{n}, \boldsymbol{\Phi}\right) p\left(y_{n} \mid \boldsymbol{x}_{n}, z_{n}=j, \boldsymbol{\theta}_{j}\right)} \frac{\partial p\left(y_{n} \mid \boldsymbol{x}_{n}, z_{n}=i, \boldsymbol{\theta}_{i}\right)}{\partial \boldsymbol{\theta}_{i}} \\ =\sum_{n=1}^{N} \frac{p\left(z_{n}=i \mid \boldsymbol{x}_{n}, \boldsymbol{\Phi}\right) p\left(y_{n} \mid \boldsymbol{x}_{n}, z_{n}=i, \boldsymbol{\theta}_{i}\right)}{\sum_{j=1}^{K} p\left(z_{n}=j \mid \boldsymbol{x}_{n}, \boldsymbol{\Phi}\right) p\left(y_{n} \mid \boldsymbol{x}_{n}, z_{n}=j, \boldsymbol{\theta}_{j}\right)} \frac{\partial \log p\left(y_{n} \mid \boldsymbol{x}_{n}, z_{n}=i, \boldsymbol{\theta}_{i}\right)}{\partial \boldsymbol{\theta}_{i}} \\ =\sum_{n=1}^{N} r_{n i} \frac{\partial \log p\left(y_{n} \mid \boldsymbol{x}_{n}, z_{n}=i, \boldsymbol{\theta}_{n}\right)}{\partial \boldsymbol{\theta}_{i}} \end{array} $ $ \begin{array}{l} \frac{\partial \mathcal{L}}{\partial \phi_{i}}=\sum_{n=1}^{N} \sum_{k=1}^{K} \frac{p\left(y_{n} \mid \boldsymbol{x}_{n}, z_{n}=k, \boldsymbol{\theta}_{i}\right)}{\sum_{j=1}^{K} p\left(z_{n}=j \mid \boldsymbol{x}_{n}, \boldsymbol{\Phi}\right) p\left(y_{n} \mid \boldsymbol{x}_{n}, z_{n}=j, \boldsymbol{\theta}_{j}\right)} \frac{\partial p\left(z_{n}=k \mid \boldsymbol{x}_{n}, \boldsymbol{\Phi}\right)}{\partial \boldsymbol{\phi}_{i}} \\ =\sum_{n=1}^{N} \sum_{k=1}^{K} \frac{p\left(y_{n} \mid \boldsymbol{x}_{n}, z_{n}=k, \boldsymbol{\theta}_{i}\right) p\left(z_{n}=k \mid \boldsymbol{x}_{n}, \boldsymbol{\Phi}\right)}{\sum_{j=1}^{K} p\left(z_{n}=j \mid \boldsymbol{x}_{n}, \boldsymbol{\Phi}\right) p\left(y_{n} \mid \boldsymbol{x}_{n}, z_{n}=j, \boldsymbol{\theta}_{j}\right)} \frac{\partial \log p\left(z_{n}=k \mid \boldsymbol{x}_{n}, \boldsymbol{\Phi}\right)}{\partial \boldsymbol{\phi}_{i}} \\ =\sum_{n=1}^{N} \sum_{k=1}^{K} r_{n k} \frac{\partial \log p\left(z_{n}=k \mid \boldsymbol{x}_{n}, \boldsymbol{\Phi}\right)}{\partial \boldsymbol{\phi}_{i}} \end{array} $ If the experts are also linear, then the whole models can be fitted efficiently using the [[Expectation Maximization]] algorithm. Randomly initialize $\Theta, \Phi$. Choose a learning rate $\eta$ and tolerance $\epsilon .$ Set $\mathcal{L}_{0}=$ inf. For $i$ in range(1, max_iter): 1. $\boldsymbol{\theta}_{k}=\boldsymbol{\theta}_{k}+\eta\left(\frac{\partial \mathcal{L}}{\partial \boldsymbol{\theta}_{k}}\right)^{T}$ 2. $\phi_{k}=\phi_{k}+\eta\left(\frac{\partial \mathcal{L}}{\partial \phi_{k}}\right)^{T}$ 3. If $\left|\mathcal{L}_{i}-\mathcal{L}_{i-1}\right| \leq \epsilon$ : break Obtain optimized params $\Theta^{*}, \Phi^{*}$ _Mixture of experts_ models presented here are still limited due to their use of linear models for gating and expert functions. Much more powerful model then follows using multilevel gating function to give the _hierarchical mixture of experts_, or HME model, which can be thought of as probabilistic version of Decision Trees. ## References 1. 14.5.3, Bishop 2006 2. Maximum likelihood - http://www2.bcs.rochester.edu/sites/jacobslab/cheat_sheet/mixture_experts.pdf 3. http://publications.idiap.ch/downloads/reports/1997/com97-05.pdf