Learning to Rank - Notes on AI

# Learning to Rank "... the task to automatically construct a ranking model using training data, such that the model can sort new objects according to their degrees of relevance, preference, or importance." - Liu [2009] **Representation:** Represent the document and query in a format that a ML model can use: a numerical vector $\vec{x} \in \mathbb{R}^{n}$ **Prediction:** - Then a ranking model $f: \vec{x} \rightarrow \mathbb{R}$ is optimized to score each document-query combination so that relevant documents are scored higher. - In mathematical terms: $f$ maps vector from a real-valued scores. Models can be trained on different data: Offline or Supervised LTR: learn from annotated data. - Expensive and time-consuming. - Provides ground-truth. Online/Counterfactual LTR: learn from user interactions. - Virtually free and easy to obtain. - Hard to interpret. Thus we have: - Feature representation of document-query pairs: $\vec{x}_{q, d} \in \mathbb{R}$. - Labels indicating the relevance of document-query pairs: $y_{q, d} \in[0,4]$. And we want: - A function $f: \vec{x} \rightarrow \mathbb{R}$ that scores documents. - To get the best ranking by sorting according to $f(\vec{x})$. How do we find $f$? - Pointwise Approach - Predict the relevance per item, simple but very naive. - Ignores that ordering of items is what matters. - Pairwise Approach - Loss based on document pairs, minimize the number of incorrect inversions. - Ignores that not all document pairs have the same impact. - Often used in the industry. - Listwise Approach - Tries to optimize for IR metrics, but they are not differentiable. - Approximations by heuristics, bounding or probabilistic approaches to ranking. - Best approach out of the three. ## The Pointwise Approach ### Regression loss Given $\langle q, d\rangle$ predict the value of $y_{q, d}$ e.g., square loss for binary or categorical labels, $ \mathcal{L}_{\text {Squared }}\left(q, d, y_{q, d}\right)=\left\|y_{q, d}-f\left(\vec{x}_{q, d}\right)\right\|^{2} $ where, $y_{q, d}$ is the one-hot representation [Fuhr, 1989] or the actual value [Cossock and Zhang, 2006] of the label. ### Classification loss Given $\langle q, d\rangle$ predict the class $y_{q, d}$ E.g., Cross-Entropy with Softmax over categorical labels $Y$ [Li et al., 2008], $ \mathcal{L}_{\mathrm{CE}}\left(q, d, y_{q, d}\right)=-\log \left(p\left(y_{q, d} \mid q, d\right)\right)=-\log \left(\frac{e^{\gamma \cdot s_{y} q, d}}{\sum_{y \in Y} e^{\gamma \cdot s_{y}}}\right) $ where, $s_{y_{q, d}}$ is the model's score for label $y_{q, d}$. ### Issues with Pointwise LTR - Class Imbalance: - many irrelevant documents and very few relevant documents. - Query level feature normalization is needed: - the distribution of features differs greatly per query. These can be overcome. But what is fundamentally wrong with these methods? **Ranking is not a regression or classification problem.** - A document-level loss does not work for ranking problems because document scores should not be considered independently. - In other words, pointwise methods **do not directly optimize ranking quality**. A lower loss does not mean better ranking! ## The Pairwise Approach Instead of looking at document-level, consider pairs of documents. $ y_{q, d_{i}}>y_{q, d_{j}} \rightarrow d_{i} \succ d_{j} $ ### Naive Pairwise Model $ P\left(d_{i} \succ d_{j}\right)=f\left(\vec{x}_{i}, \vec{x}_{j}\right) $ This method would be quadratic in complexity: $O\left(N^{2}\right),$ during inference. Pair-preferences have to be aggregated somehow. This can lead to paradoxical situations: $ \begin{array}{l} d_{1} \succ d_{2} \\ d_{2} \succ d_{3} \\ d_{3} \succ d_{1} \end{array} $ ### Pairwise Loss Functions The scoring model remains unchanged: $ f\left(\vec{x}_{i}\right)=s_{i} $ But the loss function is based on document pairs: $ \mathcal{L}_{\text {pairwise }}=\sum_{d_{i} \succ d_{j}} \phi\left(s_{i}-s_{j}\right) $ Thus we still score documents and then order according to scores. Pairwise loss minimizes the average number of inversions in ranking: - $d_{i} \succ_{q} d_{j}$ but $d_{j}$ is ranked higher than $d_{i}$ Pairwise loss generally has the following form [Chen et al., 2009], $ \mathcal{L}_{\text {pairwise }}=\phi\left(s_{i}-s_{j}\right) $ where, $\phi$ can be, - Hinge function [Herbrich et al., 2000]: $\quad \phi(z)=\max (0,1-z)$ - Exponential function [Freund et al., 2003]: $\quad \phi(z)=e^{-z}$ - Logistic function [Burges et al., 2005]: $\phi(z)=\log \left(1+e^{-z}\right)$ - etc. State of the art Pairwise LTR model - [[RankNet]] ### Issues with Pairwise LTR What is the fundamental problem with Pairwise LTR? ![[PariwiseProblem.png]] - The scoring model scores documents independently: $f\left(\vec{x}_{d_{i}}\right)=s_{i} .$ The loss is based on document pairs, and minimizes the number of incorrect inversions: $ \mathcal{L}_{\text {pairwise }}=\sum_{d_{i} \succ d_{j}} \phi\left(s_{i}-s_{j}\right) $ - However, not every document pair is equally important. - It is much more important to get the correct ordering of top documents than of the bottom documents. - For instance, the order of the top-5 is much more important than the order of documents after position 10 . ## The Listwise Approach The **fundamental problem** with the approaches so far is that they did not optimize **ranking quality** directly. A LTR method should directly optimize the ranking metric we care about. Ranking metrics can range from simple: $ \operatorname{precision}(R)=\frac{1}{|R|} \sum_{R_{i}} \operatorname{relevance}\left(R_{i}\right) $ to much more complex, e.g.[[Discounted Cumulative Gain]]: $ D C G(R)=\sum_{R_{i}} \frac{2^{\text {relevance }\left(R_{i}\right)-1}}{\log (i+1)} $ How do we optimize for these metrics? - These are non-continuous and non-differentiable. Cannot take gradient. State of the art listwise LTR models: - [[LambdaRank]] - Still used as state of the art ranking method. - [[ListNet and ListMLE]] - Create a probabilistic model for ranking, which is differentiable. Some remarks from lecturer - distinction between pairwise and listwise is not helpful: - The derivatives of listwise losses always reduce to weighted pairwise derivatives (LambdaRank is explicit, ListNet does this implicitly). - The LambdaRank with the Average Relevant Rank metric is equivalent to the pairwise RankNet loss. - In the field there are often categorization mistakes, e.g. calling LambdaRank a pairwise method. --- ## References 1. UvA IR course 2021 $\mathcal{L}_{\text {Squared }}\left(q, d, y_{q, d}\right)=\Delta_{I P S}\left(f_{}, d\right) \left\|y_{q, d}-f\left(\vec{x}_{q, d}\right)\right\|^{2}$ $ \mathcal{L}_{\text {pairwise }}=\sum_{d_{i} \succ d_{j}} (\Delta_{I P S}\left(f_{}, d_{i} \right)+\Delta_{I P S}\left(f_{}, d_{j}\right)) \phi\left(f(d_{i})-f(d_{j})\right) $ where, $\phi$ can be, - Hinge function: $\quad \phi(z)=\max (0,1-z)$ - Exponential function: $\quad \phi(z)=e^{-z}$ - Logistic function: $\phi(z)=\log \left(1+e^{-z}\right)$ - etc.