# Mixture of Experts
Mixutre of experts model combines the capability of model combination by allowing the mixing coefficients themselves to be function of the input variable, so that:
$
p(\mathbf{t} \mid \mathbf{x})=\sum_{k=1}^{K} \pi_{k}(\mathbf{x}) p_{k}(\mathbf{t} \mid \mathbf{x})
$
This is known as the _mixture of experts_ models, in which the mixing coefficients $\pi_{k}(\mathbf{x})$ are known as the gating functions, and individual component densities $p_{k}(\mathbf{t} \mid \mathbf{x})$ are called experts.
The idea is that the gating functions determine which experts are dominant in their decision region. The gating functions need to satisfy the usual constraints of $0 \leqslant \pi_{k}(\mathbf{x}) \leqslant 1$ and $\sum_{k} \pi_{k}(\mathbf{x})=1$, so can be modeled by linear softmax models
$
\pi_{n k}=\frac{\exp \left(\phi_{k}^{T} \boldsymbol{x}_{n}\right)}{\sum_{j} \exp \left(\phi_{j}^{T} \boldsymbol{x}_{n}\right)}
$
Let $\Theta$ be a matrix in $R^{D \times K}$ that contains the $D$ -dimensional column vector of parameters for each expert, and $\Phi$ is a matrix in $\mathbb{R}^{D \times K}$ that contains all of the parameters of the routing function.
Assuming iid, we can formulate likelihood of the entire dataset as
$
p(\mathbf{y} \mid \mathbf{X}, \mathbf{\Theta}, \mathbf{\Phi})=\prod_{n=1}^{N} \sum_{k=1}^{K} p\left(z_{n}=k \mid \boldsymbol{x}_{n}, \boldsymbol{\Phi}\right) p\left(y_{n} \mid \boldsymbol{x}_{n} z_{n}=k, \boldsymbol{\theta}_{k}\right)
$
And its log as
$
\log p(\mathbf{y} \mid \mathbf{X}, \mathbf{\Theta}, \mathbf{\Phi})=\sum_{n=1}^{N} \log \left(\sum_{k=1}^{K} p\left(z_{n}=k \mid \boldsymbol{x}_{n}, \boldsymbol{\Phi}\right) p\left(y_{n} \mid \boldsymbol{x}_{n}, z_{n}=k, \boldsymbol{\theta}_{k}\right)\right)
$
Let $\mathcal{L}=\log p(\mathbf{y} \mid \mathbf{X}, \mathbf{\Theta}, \boldsymbol{\Phi})$. Then,
$
\begin{array}{l}
\frac{\partial \mathcal{L}}{\partial \boldsymbol{\theta}_{i}}=\sum_{n=1}^{N} \frac{p\left(z_{n}=i \mid \boldsymbol{x}_{n}, \boldsymbol{\Phi}\right)}{\sum_{j=1}^{K} p\left(z_{n}=j \mid \boldsymbol{x}_{n}, \boldsymbol{\Phi}\right) p\left(y_{n} \mid \boldsymbol{x}_{n}, z_{n}=j, \boldsymbol{\theta}_{j}\right)} \frac{\partial p\left(y_{n} \mid \boldsymbol{x}_{n}, z_{n}=i, \boldsymbol{\theta}_{i}\right)}{\partial \boldsymbol{\theta}_{i}} \\
=\sum_{n=1}^{N} \frac{p\left(z_{n}=i \mid \boldsymbol{x}_{n}, \boldsymbol{\Phi}\right) p\left(y_{n} \mid \boldsymbol{x}_{n}, z_{n}=i, \boldsymbol{\theta}_{i}\right)}{\sum_{j=1}^{K} p\left(z_{n}=j \mid \boldsymbol{x}_{n}, \boldsymbol{\Phi}\right) p\left(y_{n} \mid \boldsymbol{x}_{n}, z_{n}=j, \boldsymbol{\theta}_{j}\right)} \frac{\partial \log p\left(y_{n} \mid \boldsymbol{x}_{n}, z_{n}=i, \boldsymbol{\theta}_{i}\right)}{\partial \boldsymbol{\theta}_{i}} \\
=\sum_{n=1}^{N} r_{n i} \frac{\partial \log p\left(y_{n} \mid \boldsymbol{x}_{n}, z_{n}=i, \boldsymbol{\theta}_{n}\right)}{\partial \boldsymbol{\theta}_{i}}
\end{array}
$
$
\begin{array}{l}
\frac{\partial \mathcal{L}}{\partial \phi_{i}}=\sum_{n=1}^{N} \sum_{k=1}^{K} \frac{p\left(y_{n} \mid \boldsymbol{x}_{n}, z_{n}=k, \boldsymbol{\theta}_{i}\right)}{\sum_{j=1}^{K} p\left(z_{n}=j \mid \boldsymbol{x}_{n}, \boldsymbol{\Phi}\right) p\left(y_{n} \mid \boldsymbol{x}_{n}, z_{n}=j, \boldsymbol{\theta}_{j}\right)} \frac{\partial p\left(z_{n}=k \mid \boldsymbol{x}_{n}, \boldsymbol{\Phi}\right)}{\partial \boldsymbol{\phi}_{i}} \\
=\sum_{n=1}^{N} \sum_{k=1}^{K} \frac{p\left(y_{n} \mid \boldsymbol{x}_{n}, z_{n}=k, \boldsymbol{\theta}_{i}\right) p\left(z_{n}=k \mid \boldsymbol{x}_{n}, \boldsymbol{\Phi}\right)}{\sum_{j=1}^{K} p\left(z_{n}=j \mid \boldsymbol{x}_{n}, \boldsymbol{\Phi}\right) p\left(y_{n} \mid \boldsymbol{x}_{n}, z_{n}=j, \boldsymbol{\theta}_{j}\right)} \frac{\partial \log p\left(z_{n}=k \mid \boldsymbol{x}_{n}, \boldsymbol{\Phi}\right)}{\partial \boldsymbol{\phi}_{i}} \\
=\sum_{n=1}^{N} \sum_{k=1}^{K} r_{n k} \frac{\partial \log p\left(z_{n}=k \mid \boldsymbol{x}_{n}, \boldsymbol{\Phi}\right)}{\partial \boldsymbol{\phi}_{i}}
\end{array}
$
If the experts are also linear, then the whole models can be fitted efficiently using the [[Expectation Maximization]] algorithm.
Randomly initialize $\Theta, \Phi$. Choose a learning rate $\eta$ and tolerance $\epsilon .$ Set $\mathcal{L}_{0}=$ inf.
For $i$ in range(1, max_iter):
1. $\boldsymbol{\theta}_{k}=\boldsymbol{\theta}_{k}+\eta\left(\frac{\partial \mathcal{L}}{\partial \boldsymbol{\theta}_{k}}\right)^{T}$
2. $\phi_{k}=\phi_{k}+\eta\left(\frac{\partial \mathcal{L}}{\partial \phi_{k}}\right)^{T}$
3. If $\left|\mathcal{L}_{i}-\mathcal{L}_{i-1}\right| \leq \epsilon$ : break
Obtain optimized params $\Theta^{*}, \Phi^{*}$
_Mixture of experts_ models are still limited due to their use of linear models for gating and expert functions. Much more powerful model then follows using multilevel gating function to give the _hierarchical mixture of experts_, or HME model, which can be thought of as probabilistic version of Decision Trees.
---
## References
1. 14.5.3, Bishop 2006
2. Maximum likelihood - http://www2.bcs.rochester.edu/sites/jacobslab/cheat_sheet/mixture_experts.pdf
3. http://publications.idiap.ch/downloads/reports/1997/com97-05.pdf