# Softmax
Softmax function is defined as
$\mathbf{\varsigma}(\boldsymbol{x})_{i}=\frac{\exp x_{i}}{\sum_{j=1}^{n} \exp x_{j}}$
## Derivative of Softmax
We know, $\varsigma(\mathbf{x}): \mathbb{R}^n \rightarrow \mathbb{R}^n$, so $\frac{d}{d\mathbf{x}} \varsigma(\mathbf{x}) \in \mathbb{R}^{n \times n}$
For cases when differentiating $\mathbf{\varsigma}(\boldsymbol{x})_{i}$ with $\boldsymbol{x}_i$,
$
\begin{align}
\frac{d\varsigma(x)_i}{dx_i} &= \frac{d}{dx_i} \frac{\exp x_{i}}{\sum_{j=1}^{n} \exp x_{j}} \\
\end{align}
$
Using quotient rule $f^{\prime}(x)=\frac{g^{\prime}(x) h(x)-g(x) h^{\prime}(x)}{[h(x)]^{2}}$,
$
\begin{align}
\frac{d\varsigma(x)_i}{dx_i} &= \frac{{\sum_{j=1}^{n} e^{x_{j}}} . e^{x_i} - e^{x_i}.e^{x_i}}{\left( {\sum_{j=1}^{n} e^{x_{j}}} \right)^2} \\
& = \frac{e^{x_i}\left( {\sum_{j=1}^{n} e^{x_{j} }- e^{x_i}} \right)}{ \left( {\sum_{j=1}^{n} e^{x_{j}}} \right)^2 } \\
&= \varsigma({x})_i(1 - \varsigma({x})_i)
\end{align}
$
For cases when differentiating $\mathbf{\varsigma}(\boldsymbol{x})_{i}$ with $\boldsymbol{x}_j$,
$
\begin{align}
\frac{d\varsigma(x)_i}{dx_j} &= \frac{d}{dx_j} \frac{\exp x_{i}}{\sum_{j=1}^{n} \exp x_{j}} \\
\end{align}
$
Again using the quotient rule $f^{\prime}(x)=\frac{g^{\prime}(x) h(x)-g(x) h^{\prime}(x)}{[h(x)]^{2}}$,
$
\begin{align}
\frac{d\varsigma(x)_i}{dx_j} &= \frac{{\sum_{j=1}^{n} e^{x_j }} . \frac{d e^{x_i}}{dx_j} - e^{x_i}.e^{x_j}}{\left( {\sum_{j=1}^{n} e^{x_{j}}} \right)^2} \\
&= \frac{- e^{x_i}.e^{x_j}}{{\left( {\sum_{j=1}^{n} e^{x_{j}}} \right)^2}}\\
&= - \varsigma(x)_i . \varsigma(x)_j
\end{align}
$
Therefore,
$
\begin{align}
\frac{d\varsigma(x)_i}{dx_j} &= \begin{cases}
\varsigma({x})_i(1 - \varsigma({x})_j) & \text{when i=j} \\
- \varsigma(x)_i . \varsigma(x)_j & \text{when i}\neq\text{j}
\end{cases}
\end{align}
$
Writing in matrix form,
$
\begin{align}
\frac{d\mathbf{\varsigma(x)}}{d\mathbf{x}} &= \begin{bmatrix}
\varsigma({x})_1(1 - \varsigma({x})_1) & ... & - \varsigma(x)_1 . \varsigma(x)_N \\
... & \varsigma({x})_i(1 - \varsigma({x})_j) & ... \\
- \varsigma(x)_1 . \varsigma(x)_N & .. & \varsigma({x})_N(1 - \varsigma({x})_N)
\end{bmatrix} \\
&=
\begin{bmatrix}
\varsigma(x)_1 & ... & 0 \\
... & \varsigma(x)_i & ...\\
0 & ... & \varsigma(x)_N\\
\end{bmatrix} -
\begin{bmatrix}
\varsigma({x})_1\varsigma({x})_1 & ... & \varsigma(x)_1 . \varsigma(x)_N \\
... & \varsigma(x)_i . \varsigma(x)_j & ... \\
- \varsigma(x)_N . \varsigma(x)_1 & .. & \varsigma(x)_N . \varsigma(x)_N
\end{bmatrix} \\
&= \operatorname{diag}(\varsigma(x)_1, ..., \varsigma(x)_N) - \varsigma(\mathbf{x})^T\varsigma(\mathbf{x})
\end{align}
$
---
## References