# Softmax Softmax function is defined as $\mathbf{\varsigma}(\boldsymbol{x})_{i}=\frac{\exp x_{i}}{\sum_{j=1}^{n} \exp x_{j}}$ ## Derivative of Softmax We know, $\varsigma(\mathbf{x}): \mathbb{R}^n \rightarrow \mathbb{R}^n$, so $\frac{d}{d\mathbf{x}} \varsigma(\mathbf{x}) \in \mathbb{R}^{n \times n}$ For cases when differentiating $\mathbf{\varsigma}(\boldsymbol{x})_{i}$ with $\boldsymbol{x}_i$, $ \begin{align} \frac{d\varsigma(x)_i}{dx_i} &= \frac{d}{dx_i} \frac{\exp x_{i}}{\sum_{j=1}^{n} \exp x_{j}} \\ \end{align} $ Using quotient rule $f^{\prime}(x)=\frac{g^{\prime}(x) h(x)-g(x) h^{\prime}(x)}{[h(x)]^{2}}$, $ \begin{align} \frac{d\varsigma(x)_i}{dx_i} &= \frac{{\sum_{j=1}^{n} e^{x_{j}}} . e^{x_i} - e^{x_i}.e^{x_i}}{\left( {\sum_{j=1}^{n} e^{x_{j}}} \right)^2} \\ & = \frac{e^{x_i}\left( {\sum_{j=1}^{n} e^{x_{j} }- e^{x_i}} \right)}{ \left( {\sum_{j=1}^{n} e^{x_{j}}} \right)^2 } \\ &= \varsigma({x})_i(1 - \varsigma({x})_i) \end{align} $ For cases when differentiating $\mathbf{\varsigma}(\boldsymbol{x})_{i}$ with $\boldsymbol{x}_j$, $ \begin{align} \frac{d\varsigma(x)_i}{dx_j} &= \frac{d}{dx_j} \frac{\exp x_{i}}{\sum_{j=1}^{n} \exp x_{j}} \\ \end{align} $ Again using the quotient rule $f^{\prime}(x)=\frac{g^{\prime}(x) h(x)-g(x) h^{\prime}(x)}{[h(x)]^{2}}$, $ \begin{align} \frac{d\varsigma(x)_i}{dx_j} &= \frac{{\sum_{j=1}^{n} e^{x_j }} . \frac{d e^{x_i}}{dx_j} - e^{x_i}.e^{x_j}}{\left( {\sum_{j=1}^{n} e^{x_{j}}} \right)^2} \\ &= \frac{- e^{x_i}.e^{x_j}}{{\left( {\sum_{j=1}^{n} e^{x_{j}}} \right)^2}}\\ &= - \varsigma(x)_i . \varsigma(x)_j \end{align} $ Therefore, $ \begin{align} \frac{d\varsigma(x)_i}{dx_j} &= \begin{cases} \varsigma({x})_i(1 - \varsigma({x})_j) & \text{when i=j} \\ - \varsigma(x)_i . \varsigma(x)_j & \text{when i}\neq\text{j} \end{cases} \end{align} $ Writing in matrix form, $ \begin{align} \frac{d\mathbf{\varsigma(x)}}{d\mathbf{x}} &= \begin{bmatrix} \varsigma({x})_1(1 - \varsigma({x})_1) & ... & - \varsigma(x)_1 . \varsigma(x)_N \\ ... & \varsigma({x})_i(1 - \varsigma({x})_j) & ... \\ - \varsigma(x)_1 . \varsigma(x)_N & .. & \varsigma({x})_N(1 - \varsigma({x})_N) \end{bmatrix} \\ &= \begin{bmatrix} \varsigma(x)_1 & ... & 0 \\ ... & \varsigma(x)_i & ...\\ 0 & ... & \varsigma(x)_N\\ \end{bmatrix} - \begin{bmatrix} \varsigma({x})_1\varsigma({x})_1 & ... & \varsigma(x)_1 . \varsigma(x)_N \\ ... & \varsigma(x)_i . \varsigma(x)_j & ... \\ - \varsigma(x)_N . \varsigma(x)_1 & .. & \varsigma(x)_N . \varsigma(x)_N \end{bmatrix} \\ &= \operatorname{diag}(\varsigma(x)_1, ..., \varsigma(x)_N) - \varsigma(\mathbf{x})^T\varsigma(\mathbf{x}) \end{align} $ --- ## References