# Learning to Rank
"... the task to automatically construct a ranking model using training data, such that the model can sort new objects according to their degrees of relevance, preference, or importance." - Liu [2009]
**Representation:**
Represent the document and query in a format that a ML model can use:
a numerical vector $\vec{x} \in \mathbb{R}^{n}$
**Prediction:**
- Then a ranking model $f: \vec{x} \rightarrow \mathbb{R}$ is optimized to score each document-query
combination so that relevant documents are scored higher.
- In mathematical terms: $f$ maps vector from a real-valued scores.
Models can be trained on different data:
Offline or Supervised LTR: learn from annotated data.
- Expensive and time-consuming.
- Provides ground-truth.
Online/Counterfactual LTR: learn from user interactions.
- Virtually free and easy to obtain.
- Hard to interpret.
Thus we have:
- Feature representation of document-query pairs: $\vec{x}_{q, d} \in \mathbb{R}$.
- Labels indicating the relevance of document-query pairs: $y_{q, d} \in[0,4]$.
And we want:
- A function $f: \vec{x} \rightarrow \mathbb{R}$ that scores documents.
- To get the best ranking by sorting according to $f(\vec{x})$.
How do we find $f$?
- Pointwise Approach
- Predict the relevance per item, simple but very naive.
- Ignores that ordering of items is what matters.
- Pairwise Approach
- Loss based on document pairs, minimize the number of incorrect inversions.
- Ignores that not all document pairs have the same impact.
- Often used in the industry.
- Listwise Approach
- Tries to optimize for IR metrics, but they are not differentiable.
- Approximations by heuristics, bounding or probabilistic approaches to ranking.
- Best approach out of the three.
## The Pointwise Approach
### Regression loss
Given $\langle q, d\rangle$ predict the value of $y_{q, d}$
e.g., square loss for binary or categorical labels,
$
\mathcal{L}_{\text {Squared }}\left(q, d, y_{q, d}\right)=\left\|y_{q, d}-f\left(\vec{x}_{q, d}\right)\right\|^{2}
$
where, $y_{q, d}$ is the one-hot representation [Fuhr, 1989] or the actual value [Cossock and Zhang, 2006] of the label.
### Classification loss
Given $\langle q, d\rangle$ predict the class $y_{q, d}$
E.g., Cross-Entropy with Softmax over categorical labels $Y$ [Li et al., 2008],
$
\mathcal{L}_{\mathrm{CE}}\left(q, d, y_{q, d}\right)=-\log \left(p\left(y_{q, d} \mid q, d\right)\right)=-\log \left(\frac{e^{\gamma \cdot s_{y} q, d}}{\sum_{y \in Y} e^{\gamma \cdot s_{y}}}\right)
$
where, $s_{y_{q, d}}$ is the model's score for label $y_{q, d}$.
### Issues with Pointwise LTR
- Class Imbalance:
- many irrelevant documents and very few relevant documents.
- Query level feature normalization is needed:
- the distribution of features differs greatly per query.
These can be overcome. But what is fundamentally wrong with these methods?
**Ranking is not a regression or classification problem.**
- A document-level loss does not work for ranking problems because document scores should not be considered independently.
- In other words, pointwise methods **do not directly optimize ranking quality**. A lower loss does not mean better ranking!
## The Pairwise Approach
Instead of looking at document-level, consider pairs of documents.
$
y_{q, d_{i}}>y_{q, d_{j}} \rightarrow d_{i} \succ d_{j}
$
### Naive Pairwise Model
$
P\left(d_{i} \succ d_{j}\right)=f\left(\vec{x}_{i}, \vec{x}_{j}\right)
$
This method would be quadratic in complexity: $O\left(N^{2}\right),$ during inference.
Pair-preferences have to be aggregated somehow. This can lead to paradoxical situations:
$
\begin{array}{l}
d_{1} \succ d_{2} \\
d_{2} \succ d_{3} \\
d_{3} \succ d_{1}
\end{array}
$
### Pairwise Loss Functions
The scoring model remains unchanged:
$
f\left(\vec{x}_{i}\right)=s_{i}
$
But the loss function is based on document pairs:
$
\mathcal{L}_{\text {pairwise }}=\sum_{d_{i} \succ d_{j}} \phi\left(s_{i}-s_{j}\right)
$
Thus we still score documents and then order according to scores.
Pairwise loss minimizes the average number of inversions in ranking:
- $d_{i} \succ_{q} d_{j}$ but $d_{j}$ is ranked higher than $d_{i}$
Pairwise loss generally has the following form [Chen et al., 2009],
$
\mathcal{L}_{\text {pairwise }}=\phi\left(s_{i}-s_{j}\right)
$
where, $\phi$ can be,
- Hinge function [Herbrich et al., 2000]: $\quad \phi(z)=\max (0,1-z)$
- Exponential function [Freund et al., 2003]: $\quad \phi(z)=e^{-z}$
- Logistic function [Burges et al., 2005]: $\phi(z)=\log \left(1+e^{-z}\right)$
- etc.
State of the art Pairwise LTR model - [[RankNet]]
### Issues with Pairwise LTR
What is the fundamental problem with Pairwise LTR?
![[PariwiseProblem.png]]
- The scoring model scores documents independently: $f\left(\vec{x}_{d_{i}}\right)=s_{i} .$ The loss is based on document pairs, and minimizes the number of incorrect inversions:
$
\mathcal{L}_{\text {pairwise }}=\sum_{d_{i} \succ d_{j}} \phi\left(s_{i}-s_{j}\right)
$
- However, not every document pair is equally important.
- It is much more important to get the correct ordering of top documents than of the bottom documents.
- For instance, the order of the top-5 is much more important than the order of
documents after position 10 .
## The Listwise Approach
The **fundamental problem** with the approaches so far is that they did not optimize **ranking quality** directly.
A LTR method should directly optimize the ranking metric we care about.
Ranking metrics can range from simple:
$
\operatorname{precision}(R)=\frac{1}{|R|} \sum_{R_{i}} \operatorname{relevance}\left(R_{i}\right)
$
to much more complex, e.g.[[Discounted Cumulative Gain]]:
$
D C G(R)=\sum_{R_{i}} \frac{2^{\text {relevance }\left(R_{i}\right)-1}}{\log (i+1)}
$
How do we optimize for these metrics?
- These are non-continuous and non-differentiable. Cannot take gradient.
State of the art listwise LTR models:
- [[LambdaRank]]
- Still used as state of the art ranking method.
- [[ListNet and ListMLE]]
- Create a probabilistic model for ranking, which is differentiable.
Some remarks from lecturer - distinction between pairwise and listwise is not helpful:
- The derivatives of listwise losses always reduce to weighted pairwise derivatives (LambdaRank is explicit, ListNet does this implicitly).
- The LambdaRank with the Average Relevant Rank metric is equivalent to the pairwise RankNet loss.
- In the field there are often categorization mistakes, e.g. calling LambdaRank a pairwise method.
---
## References
1. UvA IR course 2021
$\mathcal{L}_{\text {Squared }}\left(q, d, y_{q, d}\right)=\Delta_{I P S}\left(f_{}, d\right) \left\|y_{q, d}-f\left(\vec{x}_{q, d}\right)\right\|^{2}$
$
\mathcal{L}_{\text {pairwise }}=\sum_{d_{i} \succ d_{j}} (\Delta_{I P S}\left(f_{}, d_{i} \right)+\Delta_{I P S}\left(f_{}, d_{j}\right)) \phi\left(f(d_{i})-f(d_{j})\right)
$
where, $\phi$ can be,
- Hinge function: $\quad \phi(z)=\max (0,1-z)$
- Exponential function: $\quad \phi(z)=e^{-z}$
- Logistic function: $\phi(z)=\log \left(1+e^{-z}\right)$
- etc.