# RankNet RankNet [Burges et al., 2005] is a pairwise loss function-popular choice for training neural LTR models and also an industry favourite [Burges, 2015]. $ \begin{array}{l} \text { Predicted probabilities: } P_{i j}=P\left(s_{i}>s_{j}\right) \equiv \frac{e^{\gamma \cdot s_{i}}}{e^{\gamma \cdot s_{i}}+e^{\gamma \cdot s_{j}}}=\frac{1}{1+e^{-\gamma\left(s_{i}-s_{j}\right)}} \\ \text { and } P_{j i} \equiv \frac{1}{1+e^{-\gamma\left(s_{j}-s_{i}\right)}} \end{array} $ Desired probabilities: $\bar{P}_{i j}=1$ and $\bar{P}_{j i}=0$ Computing cross-entropy between $\bar{P}$ and $P$, $ \begin{aligned} \mathcal{L}_{\text {RankNet }} &=-\bar{P}_{i j} \log \left(P_{i j}\right)-\bar{P}_{j i} \log \left(P_{j i}\right) \\ &=-\log \left(P_{i j}\right) \\ &=\log \left(1+e^{-\gamma\left(s_{i}-s_{j}\right)}\right) \end{aligned} $ Let $S_{i j} \in\{-1,0,1\}$ indicate the preference between $d_{i}$ and $d_{j}$. Then the desired probability for a pair is: $ \bar{P}\left(d_{i} \succ d_{j}\right)=\frac{1}{2}\left(1-S_{i j}\right) $ The predicted probability is: $ P\left(d_{i} \succ d_{j}\right)=\frac{1}{1+e^{-\gamma\left(s_{i}-s_{j}\right)}} $ The cross-entropy loss is then: $ \mathcal{L}_{i j}=\frac{1}{2}\left(1-S_{i j}\right) \gamma\left(s_{i}-s_{j}\right)+\log \left(1+e^{-\gamma\left(s_{i}-s_{j}\right)}\right) $ The cross-entropy loss is then: $ \mathcal{L}_{i j}=\frac{1}{2}\left(1-S_{i j}\right) \gamma\left(s_{i}-s_{j}\right)+\log \left(1+e^{-\gamma\left(s_{i}-s_{j}\right)}\right) $ The derivate w.r.t. $s_{i}$ $ \frac{\delta \mathcal{L}_{i j}}{\delta s_{i}}=\gamma\left(\frac{1}{2}\left(1-S_{i j}\right)-\frac{1}{1+e^{-\gamma\left(s_{i}-s_{j}\right)}}\right)=-\frac{\delta \mathcal{L}_{i j}}{\delta s_{j}} $ Then we can factorize the loss it so that: $ \frac{\delta \mathcal{L}_{i j}}{\delta w}=\frac{\delta \mathcal{L}_{i j}}{\delta s_{i}} \frac{\delta s_{i}}{\delta w}+\frac{\delta \mathcal{L}_{i j}}{\delta s_{j}} \frac{\delta s_{j}}{\delta w}=\gamma\left(\frac{1}{2}\left(1-S_{i j}\right)-\frac{1}{1+e^{-\gamma\left(s_{i}-s_{j}\right)}}\right)\left(\frac{\delta s_{i}}{\delta w}-\frac{\delta s_{j}}{\delta w}\right) $ We choose $\lambda$ so that: $ \frac{\delta \mathcal{L}_{i j}}{\delta w}=\lambda_{i j}\left(\frac{\delta s_{i}}{\delta w}-\frac{\delta s_{j}}{\delta w}\right) $ where: $ \lambda_{i j}=\gamma\left(\frac{1}{2}\left(1-S_{i j}\right)-\frac{1}{1+e^{-\gamma\left(s_{i}-s_{j}\right)}}\right) $ These lambdas act like forces pushing pairs of documents apart or together. On document level the same can be done: $ \lambda_{i}=\sum_{j} \lambda_{i j} $ Issues with RankNet: - RankNet is based on virtual probabilities: $P\left(d_{i} \succ d_{j}\right)$. - In reality the ranking model does not follow these probabilities. - Not elegant, but not a big deal. --- ## References