# Counterfactual Evaluation and LTR Counterfactual Evaluation: Evaluate a new ranking function $f_{\theta}$ using historical interaction data (e.g., clicks) collected from a previously deployed ranking function $f_{\text {deploy }}$. One main thing to keep in mind when developing counterfactual evaluation and LTR is: User-interactions on rankings are very biased, mainly because of noise in user clicks and positional bias of top ranked documents. Counterfactual Learning to Rank: - Unbiased learning from historical interaction logs. - Correct for position bias with inverse propensity scoring. - Requires an explicit user model. Estimating users' examination probabilities: - Through randomization or joint learning. If we know the true relevance labels $\left(y\left(d_{i}\right)\right.$ for all $\left.i\right)$, we can compute any additive linearly decomposable IR metric as: $ \Delta\left(f_{\theta}, D, y\right)=\sum_{d_{i} \in D .} \lambda\left(\operatorname{rank}\left(d_{i} \mid f_{\theta}, D\right)\right) \cdot y\left(d_{i}\right) $ where $\lambda$ is a rank weighting function, e.g., - Average Relevant Position $\quad A R P: \lambda(r)=r$, - Discounted Cumulative Gain $D C G: \lambda(r)=\frac{1}{\log _{2}(1+r)}$, - Precision at k $\lambda(r)=\frac{\mathbf{1}[r \leq k]}{k}$ We often do not know the true relevance labels $\left(y\left(d_{i}\right)\right)$, but can only observe implicit feedback in the form of, e.g., clicks: - A click $c_{i}$ on document $d_{i}$ is a biased and noisy indicator that $d_{i}$ is relevant - A missing click does not necessarily indicate non-relevance - Relevance: the document may not be relevant. - Observance: the user may not have examined the document. - Miscellaneous: various random reasons why a user may not click. ## Examination User Model If we only consider *examination* and *relevance*, a user click can be modelled by: - The probability of document $d_{i}$ being examined $\left(o_{i}=1\right)$ in a ranking $R$ : $ P\left(o_{i}=1 \mid R, d_{i}\right) $ - The probability of a click $c_{i}=1$ on $d_{i}$ given its relevance $y\left(d_{i}\right)$ and whether it was examined $o_{i}$ : $ P\left(c_{i}=1 \mid o_{i}, y\left(d_{i}\right)\right) $ - Clicks only occur on examined documents, thus the probability of a click in ranking $R$ is: $ P\left(c_{i}=1 \wedge o_{i}=1 \mid y\left(d_{i}\right), R\right)=P\left(c_{i}=1 \mid o_{i}=1, y\left(d_{i}\right)\right) \cdot P\left(o_{i}=1 \mid R, d_{i}\right) $ ## Naive Estimator A naive way to estimate is to assume clicks are a unbiased relevance signal: $ \Delta_{N A I V E}\left(f_{\theta}, D, c\right)=\sum_{d_{i} \in D} \lambda\left(\operatorname{rank}\left(d_{i} \mid f_{\theta}, D\right)\right) \cdot c_{i} . $ - Even if no click noise is present: $P\left(c_{i}=1 \mid o_{i}=1, y\left(d_{i}\right)\right)=y\left(d_{i}\right)$, this estimator is biased by the examination probabilities. - The biased estimator weights documents according to their examination probabilities in the ranking $R$ displayed during logging: $ \mathbb{E}_{o}\left[\Delta_{N A I V E}\left(f_{\theta}, D, c\right)\right]=\sum_{d_{i} \in D} P\left(o_{i}=1 \mid R, d_{i}\right) \cdot \lambda\left(\operatorname{rank}\left(d_{i} \mid f_{\theta}, D\right)\right) \cdot y\left(d_{i}\right) $ - In rankings, documents at higher ranks are more likely to be examined: position bias. - Position bias causes logging-policy-confirming behavior: - Documents displayed at higher ranks during logging are incorrectly considered as more relevant. ## Inverse Propensity Scoring Recently proposed (2016). Counterfactual evaluation accounts for bias using *Inverse Propensity Scoring (IPS)*: The IPS estimator weights clicks inversely proportional to examiniation probabilities. $ \Delta_{I P S}\left(f_{\theta}, D, c\right)=\sum_{d_{i} \in D} \frac{\lambda\left(\operatorname{rank}\left(d_{i} \mid f_{\theta}, D\right)\right)}{P\left(o_{i}=1 \mid R, d_{i}\right)} \cdot c_{i} $ - $\lambda\left(\operatorname{rank}\left(d_{i} \mid f_{\theta}, D\right)\right):$ (weighted) rank of document $d_{i}$ by ranker $f_{\theta}$, - $c_{i}:$ observed click on the document in the log, - $P\left(o_{i}=1 \mid R, d_{i}\right):$ examination probability of $d_{i}$ in ranking $R$ displayed during logging. This is an unbiased estimate of any additive linearly decomposable IR metric (if no click noise is present). However, the IPS approach still works without this assumption, as long as: $ y\left(d_{i}\right)>y\left(d_{j}\right) \Leftrightarrow P\left(c_{i}=1 \mid o_{i}=1, y\left(d_{i}\right)\right)>P\left(c_{j}=1 \mid o_{j}=1, y\left(d_{j}\right)\right) . $ Since we can prove relative differences are inferred unbiasedly: $ \mathbb{E}_{o, c}\left[\Delta_{I P S}\left(f_{\theta}, D, c\right)\right]>\mathbb{E}_{o, c}\left[\Delta_{I P S}\left(f_{\theta^{\prime}}, D, c\right)\right] \Leftrightarrow \Delta\left(f_{\theta}, D\right)>\Delta\left(f_{\theta^{\prime}}, D\right) . $ ## Propensity-weighted LTR $ \Delta_{I P S}\left(f_{\theta}, D, c\right)=\sum_{d_{i} \in D} \frac{\lambda\left(\operatorname{rank}\left(d_{i} \mid f_{\theta}, D\right)\right)}{P\left(o_{i}=1 \mid R, d_{i}\right)} \cdot c_{i} $ How do we optimize for this unbiased performance estimate? - It is not differentiable. - Common problem for all ranking metrics. We have to find a differential counterpart. - Rank-SVM (Joachims, 2002) optimizes the following *differentiable upper bound*: $ \begin{aligned} \operatorname{rank}\left(d \mid f_{\theta}, D\right) &=\sum_{d^{\prime} \in R} \mathbb{1}\left[f_{\theta}(d) \leq f_{\theta}\left(d^{\prime}\right)\right] \\ & \leq \sum_{d^{\prime} \in R} \max \left(1-\left(f_{\theta}(d)-f_{\theta}\left(d^{\prime}\right)\right), 0\right)=\overline{\operatorname{rank}}\left(d \mid f_{\theta}, D\right) . \end{aligned} $ - Alternative choices are possible, i.e., a sigmoid-like bound (with parameter $\sigma$ ): $ \operatorname{rank}\left(d \mid f_{\theta}, D\right) \leq \sum_{d^{\prime} \in R} \log _{2}\left(1+\exp ^{-\sigma\left(f_{\theta}(d)-f_{\theta}\left(d^{\prime}\right)\right)}\right) $ - Commonly used for pairwise learning, LambdaMart (Burges, 2010 ), and Lambdaloss (Wang et al., 2018b). ### Average Relevance Position For the Average Relevance Position metric: $ \Delta_{A R P}\left(f_{\theta}, D, y\right)=\sum_{d_{i} \in D} \operatorname{rank}\left(d_{i} \mid f_{\theta}, D\right) \cdot y\left(d_{i}\right) $ This gives us an *unbiased estimator* and *upper bound*: $ \begin{aligned} \Delta_{A R P-I P S}\left(f_{\theta}, D, c\right) &=\sum_{d_{i} \in D} \frac{\operatorname{rank}\left(d_{i} \mid f_{\theta}, D\right)}{P\left(o_{i}=1 \mid R, d_{i}\right)} \cdot c_{i} \\ & \leq \sum_{d_{i} \in D} \frac{\overline{\operatorname{rank}}\left(d_{i} \mid f_{\theta}, D\right)}{P\left(o_{i}=1 \mid R, d_{i}\right)} \cdot c_{i} \end{aligned} $ This upper bound is *differentiable* and *optimizable* by stochastic gradient descent or Quadratic Programming, i.e., Rank-SVM (Joachims, 2006). ## Additive Metrics A similar approach can be applied to additive metrics (Agarwal et al., 2019a). If $\lambda$ is a monotonically decreasing function: $ x \leq y \Rightarrow \lambda(x) \geq \lambda(y) $ then: $ \operatorname{rank}(d \mid \cdot) \leq \overline{\operatorname{rank}}(d \mid \cdot) \Rightarrow \lambda(\operatorname{rank}(d \mid \cdot)) \geq \lambda(\overline{\operatorname{rank}}(d \mid \cdot)) $ This provides a lower bound, for instance for [[Discounted Cumulative Gain]] (DCG): $ \frac{1}{\log _{2}(1+\operatorname{rank}(d \mid \cdot))} \geq \frac{1}{\log _{2}(1+\overline{\operatorname{rank}}(d \mid \cdot))} $ Then for the Discounted Cumulative Gain metric: $ \Delta_{D C G}\left(f_{\theta}, D, y\right)=\sum_{d_{i} \in D} \log _{2}\left(1+\operatorname{rank}\left(d_{i} \mid f_{\theta}, D\right)\right)^{-1} \cdot y\left(d_{i}\right) $ This gives us an *unbiased estimator* and *lower bound*: $ \begin{aligned} \Delta_{D C G-I P S}\left(f_{\theta}, D, c\right) &=\sum_{d_{i} \in D} \frac{\log _{2}\left(1+\operatorname{rank}\left(d_{i} \mid f_{\theta}, D\right)^{-1}\right.}{P\left(o_{i}=1 \mid R, d_{i}\right)} \cdot c_{i} \\ & \geq \sum_{d_{i} \in D} \frac{\log _{2}\left(1+\overline{\operatorname{rank}}\left(d_{i} \mid f_{\theta}, D\right)^{-1}\right.}{P\left(o_{i}=1 \mid R, d_{i}\right)} \cdot c_{i} . \end{aligned} $ This lower bound is *differentiable* and *optimizable* by stochastic gradient descent or the Convex-Concave Procedure (Agarwal et al., 2019a). ## Algorithm - Obtain a model of position bias. - Acquire a large click-log. - Then for every click in the log: - Compute the propensity of the click: $ P\left(o_{i}=1 \mid R, d_{i}\right) . $ - Calculate the gradient of the bound on the unbiased estimator: $ \nabla_{\theta}\left[\frac{\lambda\left(\overline{\operatorname{rank}}\left(d_{i} \mid f_{\theta}, D\right)\right)}{P\left(o_{i}=1 \mid R, d_{i}\right)}\right] $ - Update the model $f_{\theta}$ by adding/subtracting the gradient. ## Estimating Position Bias At the core of counterfactual LTR methods is the propensity score: $P\left(o_{i}=1 \mid R, d_{i}\right)$, which helps remove bias from user interactions. How can this propensity score be estimated for a specific kind of bias: position bias? Idea: If you randomly display the doucments, its only the rank that matters and no the position. ![[RandTop-n.png]] Rand Top-n Algorithm: Repeat: - Randomly shuffle the top n items - Record clicks Aggregate clicks per rank Normalize to obtain propensities $p_{i} \propto P\left(o_{i} \mid i\right)$ Note: we only need propensities proportional to the true observation probability for learning. But - Uniformly randomizing the top n results may negatively impacts users during data logging. Still companies use it when your system is already not very good to begin with. There are various methods that minimize the impact to the user: - RandPair: Choose a pivot rank k and only swap a random other document with the document at this pivot rank (Joachims et al., 2017). - Interventional Sets: Exploit inherent "randomness" in data coming from multiple rankers (e.g.,A/B tests in production logs) (Agarwal et al., 2017). ### Intervention Harvesting - Main idea: In real-world production systems many (randomized) interventions take place due to $A / B$ tests. Can we use these interventions instead? - This approach is called intervention harvesting (Agarwal et al. (2017); Fang et al. (2019); Agarwal et al. (2019b)) ![[Intervention Harvesting.png]] ## Jointly Learning and Estimating Instead of treating propensity estimation and unbiased learning to rank as two separate tasks, recent work has explored jointly learning rankings and estimating propensities. Recall that the probability of a click can be decomposed as: $ \underbrace{P\left(c_{i}=1 \wedge o_{i}=1 \mid y\left(d_{i}\right), R\right)}_{\text {click probability }}=\underbrace{P\left(c_{i}=1 \mid o_{i}=1, y\left(d_{i}\right)\right)}_{\text {relevance probability }} \cdot \underbrace{P\left(o_{i} \mid R, d_{i}\right)}_{\text {observation probability }} $ If the observation probability is known, we can find an unbiased estimate of relevance via IPS. It is possible to jointly learn and estimate by iterating two steps: - Learn an optimal ranker given a correct propensity model: $ \underbrace{P\left(c_{i}=1 \mid o_{i}=1, y\left(d_{i}\right)\right)}_{\text {relevance probability }}=\frac{P\left(c_{i}=1 \wedge o_{i}=1 \mid y\left(d_{i}\right), R\right)}{P\left(o_{i} \mid R, d_{i}\right)} . $ - Learn an optimal propensity model given a correct ranker: $ \underbrace{P\left(o_{i} \mid R, d_{i}\right)}_{\text {observation probability }}=\frac{P\left(c_{i}=1 \wedge o_{i}=1 \mid y\left(d_{i}\right), R\right)}{P\left(c_{i}=1 \mid o_{i}=1, y\left(d_{i}\right)\right)} \text { . } $ --- ## References