# Online Evaluation and LTR - Online Evaluation methods have control over what to display to the user. - At the same time they: - Decide what results to display to the user. - Infer preferences from user interactions with the chosen results. - These methods can be much more efficient, because they have (some) control over what data is gathered. - Comparison with the [[Counterfactual Evaluation and LTR]]: - Empirically: Online methods appear to be more reliable. - Theoretically: Counterfactual methods are much more advantageous. ![[Online Evaluation.png]] - An online evaluation algorithm should: - Give reliable and correct results: - Guarantee to provide correct comparison outcomes. i.e. theoretical proof. - Robust to interaction noise. - Handle biases in user interactions. - Provide good user experience: - Methods should not interfere with user task. ## Interleaving To answer the question: Is ranker A be preferred over ranker B? Specific aspects of interactions in rankings can be used for efficient comparisons. The idea: - Take the two rankings for a query from two rankers. - Create an interleaved ranking, based on both rankings. - Clicks on an interleaved ranking provide preference signals between rankers. ### Team-Draft Interleaving - Until k documents are placed: - Randomly choose ranker $\mathbf{A}$ or $\mathrm{B}$. - Let chosen ranker place its next unplaced document. - Let other ranker place its next unplaced document. - Remember which ranker placed which document. - Present interleaving to user, observe clicks. - Ranker with the most clicks on its placed document wins. Team-Draft interleaving finds no preferences under random clicks. ![[Team-Draft Interleaving.png]] Properties of Team-Draft interleaving: - User experience: - User _hardly affected_ by method. - Experience will not be worse than that of the worst ranker. - Correctness: - Ranker that places relevant documents usually wins. - Correct outcomes not guaranteed. ### Probabilistic Interleaving Introduced by Hofmann et al. (2011) designed around the fidelity conditions. Treats rankers as probability distributions over a set of documents. A ranker $A$ with the ranking: $ R_{A}(D)=\left[d_{1}, d_{2}, \ldots d_{N}\right] $ then let $\operatorname{rank}\left(d, R_{A}\right)$ be the rank of $d$ in $R_{A}$. The distribution for ranker $\mathbf{A}$ is modelled by (with $\tau \in \mathbb{R}_{>0}$ ): $ P\left(d \mid D, R_{A}\right)=P_{A}(d)=\frac{\frac{1}{\operatorname{rank}\left(d, R_{A}\right)^{\top}}}{\sum_{d^{\prime} \in D} \frac{1}{\operatorname{rank}\left(d^{\prime}, R_{A}\right)^{\top}}} $ Renormalize after each document is removed, i.e. remove sampled document from D. ![[Probabilistic Interleaving.png]] Does this method have Fidelity? - Could a ranker have an advantage due to factors other than relevance? - Every ranker is equally likely to place at every rank. - Thus expected number of clicks is equal under random interactions. - Will an unambiguous winner always win in expectation? - Every ranker is equally likely to place at every rank. - Dominating ranker is more likely to place relevant documents at each rank. - If relevant documents are more likely to be clicked, then the dominating ranker wins in expectation. Quite trivial to show that this method has fidelity. For rankings $R_{A}, R_{B}$, the interleaved list $L$, assignments $T$, and clicks $c$, the _outcome of a comparison_ can be noted as: $ O\left(R_{A}, R_{B}, L, T, c\right) \in\{-1,0,1\} $ Since clicks are independent of the assignment $T$, we can marginalize over all possible assignments to reach an expected outcome: $ E\left[O\left(R_{A}, R_{B}, L, c\right)\right]=\sum_{T} P\left(T \mid R_{A}, R_{B}, L\right) O\left(R_{A}, R_{B}, L, T, c\right) $ This gives us the following method: - Compute $P_{A}$ and $P_{B}$ from ranker $\mathbf{A}$ and $\mathrm{B}$ respectively. - Repeat until $k$ documents placed: - Randomly choose $P_{A}$ or $P_{B}$ and sample a document $d$. - Place $d$ _without remembering_ whether $\mathbf{A}$ or $\mathrm{B}$ was chosen. - Renormalize $P_{A}$ or $P_{B}$ after removing $d$. - Display to user and observe clicks. - Calculate the _expected outcome_ marginalizing over all possible placements. - Expected winner wins the comparison. Properties of Probabilistic Interleaving: - Correctness: - Correct outcomes guaranteed w.r.t. fidelity. - Marginalization does not affect the expected outcomes. - Thus if proto-method has fidelity, so has this method. - User experience: - User experience not guaranteed. - Every possible ranking can be displayed, albeit with different probabilities. ### Optimized Interleaving Introduced by Radlinski and Craswell (2013) casts interleaving as an optimization problem. Interleavings should only contain top-documents from rankers, i.e. rankers should always add their top document. First step: determine the set of allowed interleavings. Optimized interleaving can use different scoring functions that meet its requirements. A click on a document $d$ gives a preference score determined by how its ranked. Common choices are: **Linear Rank Difference**: $ s\left(d, R_{A}, R_{B}\right)=\delta_{d}=\operatorname{rank}\left(d, R_{A}\right)-\operatorname{rank}\left(d, R_{B}\right) $ **Inverse Rank Difference**: $ s\left(d, R_{A}, R_{B}\right)=\delta_{d}=\frac{1}{\operatorname{rank}\left(d, R_{A}\right)}-\frac{1}{\operatorname{rank}\left(d, R_{B}\right)} $ Let's assume that a user is only position biased, that means that only the position determines the click probability. If we take $p_{L}$ for the probability of interleaving $L$ being displayed, then the expected outcome can be written as: $ E[O]=\sum_{L \in \mathcal{L}}\left(p_{L} \sum_{i=1}^{|L|} P(\operatorname{click}(i)) s\left(L_{i}, R_{A}, R_{B}\right)\right)=0 $ This becomes a linear optimization (or linear programming) task to find a $p_{L}$ to meet this requirement. Properties of Optimized Interleaving: User experience: - Strongest guarantees of all interleaving methods. Correctness: - Method has fidelity (if optimized for it), as long as the linear optimization is successful. - Proven by brute-forcing that there is always a solution for top-10 rankings. - Can be correct under other definitions as well. ## Dueling Bandit Gradient Descent ![[Dueling Bandit GD.png]] Yue and Joachims (2009) prove that under the assumptions: - There is a single optimal set of parameters: $\theta^{*}$. - The utility space w.r.t. $\theta$ is convex and smooth, i.e., small changes in $\theta$ lead to small changes in user experience. Then Dueling Bandit Gradient Descent is proven to have a sublinear regret: - The algorithm will eventually approximate the ideal model. - The duration of time is effected by the number of parameters of the model, the smoothness of the space, the unit chosen, etc. Hofmann et al. (2013) introduced the Candidate Pre-Selection method: - Sample a large number of rankers to create a candidate set. - Use historical interactions to select the most promising candidate for DBGD. ### Multileave Gradient Descent - The introduction of multileaving in online evaluation allowed for multiple rankers being compared simultaneously from a single interaction. - A natural extension of Dueling Bandit Gradient Descent is to combine it with multileaving, resulting in Multileave Gradient Descent (Schuth et al., 2016). - Multileaving allows comparisons with multiple candidate rankers, increasing the chance of finding an improvement. Properties of Multileave Gradient Descent: - Vastly speeds up the learning rate of Dueling Bandit Gradient Descent. - Much better user experience. - Instead of limiting (guiding) exploration, it is done more efficiently. Huge computational costs, large number of rankers have to be applied. ### Problems with Dueling Bandit Gradient Descent Theoretical properties: - Currently, no sound regret bounds proven. Empirical observations: - Methods do not approach optimal performance. - Neural models have no advantage over linear models. Possible solutions: - Extend the algorithm (the last decade of research) or introduce new model. - Find an approach different to the bandit approach. Similar to existing pairwise methods (Oosterhuis and de Rijke, 2017; Joachims, 2002), Pairwise Differentiable Gradient Descent infers pairwise document preferences from user clicks: This approach is biased: - Some preferences are more likely to be inferred due to position/selection bias. ## Pairwise Differentiable Gradient Descent Start with initial model $\theta_{t}$, then indefinitely: (1) Wait for a user query. (2) Sample (without replacement) a ranking $R$ from the document distribution: $ P\left(d \mid D, \theta_{t}\right)=\frac{\exp ^{f_{\theta_{t}}(\mathbf{d})}}{\sum_{d^{\prime} \in D} \exp ^{f_{\theta_{t}}\left(\mathbf{d}^{\prime}\right)}} $ (3) Display the ranking $R$ to the user. (4) Infer document preferences from the user clicks: $\mathbf{c} .$ (5) Update model according to the estimated (unbiased) gradient: $ \nabla f_{\theta_{t}}(\cdot) \approx \sum_{d_{i}>_{\mathbf{c}} d_{j}} \rho\left(d_{i}, d_{j}, R\right) \nabla P\left(d_{i} \succ d_{j} \mid D, \theta_{t}\right) $ Dueling Bandit Gradient Descent (DBGD): - Unable to reach optimal performance in ideal settings. - Strongly affected by noise and position bias. - Sublinear regret bounds proven, unsound for ranking problems as commonly applied. - Single update steps are as unbiased as its interleaving method. Pairwise Differentiable Gradient Descent (PDGD): - Capable of reaching optimal performance in ideal settings. - Robust to noise and position bias. - Considerably outperforms DBGD in all tested experimental settings. - No regret bounds proven. - Single update steps are unbiased w.r.t. pairwise document preferences. For the common ranking problem , neither approach has a theoretical advantage. --- ## References