# Collaborative filtering
- Collaborative (Social) filtering
- Leverage ratings of a user u + other users in the system
- Key Idea: If two users u and v rated an item similarly, and user v has rated an item, user u's rating will be similar
- Content of items no longer needed!
- Content may be a bad indicator depending on the domain/circumstances
Methodology
- Neighbourhood-based
- Use stored ratings directly
- Nearest neighbours
Model-based
- Learn a predictive model, model user-item interactions (latent factors)
- Predict new/incomplete ratings using trained model t
What is being recommended
- User based
- Use other users to infer ratings of an item for a user
- 'Neighbours' - users who have similar ratings
- Item-based
- Use ratings of similar items (that user has rated) to predict the rating for a given item
Type of ratings:
- Explicit Ratings
- Like/Dislike; ImDB ratings
- Might not be available
- Implicit
- Time spent on web-page; Clicks
- More available, but intention is unclear (suffers from biases and ambiguity)
## Neighbourhood-based recommenders
Explicit ratings, not sparse
$
\begin{array}{c||c|c|c|c|c}
\hline & \begin{array}{c}
\text { The } \\
\text { Matrix }
\end{array} & \text { Titanic } & \begin{array}{c}
\text { Die } \\
\text { Hard }
\end{array} & \begin{array}{c}
\text { Forrest } \\
\text { Gump }
\end{array} & \text { Wall-E } \\
\hline \hline \text { John } & 5 & 1 & & 2 & 2 \\
\text { Lucy } & 1 & 5 & 2 & 5 & 5 \\
\text { Eric } & 2 & ? & 3 & 5 & 4 \\
\text { Diane } & 4 & 3 & 5 & 3 & \\
\hline
\end{array}
$
- Eric and Lucy have similar tastes
- Eric and Diane have different tastes
- What should the rating be for Eric for Titanic?
User representation - Rating vector (dimension = item)
Consider k-nearest neighbours of user $u$ who rated item $i: \mathcal{N}_{i}(u)$
$
r_{u i}=\frac{1}{\left|\mathcal{N}_{i}(u)\right|} \sum_{v \in \mathcal{N}_{i}(u)} r_{v i}
$
- But Lucy is more similar to Eric than Diane
Consider the similarity
$
r_{u i}=\frac{\sum_{v \in \mathcal{N}_{i}(u)} w_{u v} r_{v i}}{\sum_{v \in \mathcal{N}_{i}(u)}\left|w_{u v}\right|}
$
Given similarity of < Eric, Lucy >=0.75 and < Eric, Diane >=0.15
$
r=\frac{0.75 \times 5+0.15 \times 3}{0.75+0.15} \approx 4.67
$
- Users may use different rating values -5 ' for John (who always rates $1 / 2 / 5$ ) might not be equal to ' 5 ' from Diane (who rates $3 / 4 / 5)$
- Solution: Normalise the ratings per user eg. Mean entering or Z-score normalization
Other problems:
- Limited Coverage - users can be neighbours only if they rated common items
- Sparsity makes it worse as number of items increase!
### Matrix Factorization
- Attempt to solve sparsity/coverage problems by projecting user/item vectors to a dense latent space by
- Decompose rating matrix
- Decompose similarity matrix
Given $R,$ a $|\mathscr{U}| \times|\mathcal{F}|$ matrix of rank $n$
- Approximate using [[Singular Value Decomposition (SVD)]] by $\hat{R}=P Q^{T}$ of rank $k<n$
- $\mathrm{P}$ is $\mathrm{a}|\mathscr{U}| \times k$ matrix of users
- $\mathrm{Q}$ is a $|\mathscr{F}| \times k$ matrix of items
$
\operatorname{err}(P, Q)=|| R-P Q^{T}||_{F}^{2}=\sum_{u, i}\left(r_{u i}-p_{u} q_{i}^{T}\right)^{2}
$
### Advantages and Disadvantages
**Pros**:
- No feature engineering required—works directly with user-item interactions
- Content-independent—doesn't need item descriptions or metadata
- Discovers unexpected connections—can recommend items that seem unrelated but appeal to similar users
- Addresses filter bubbles by introducing diversity beyond content similarity
**Cons**:
- Sparsity problem—struggles when user-item interaction data is sparse
- Cold start problem—cannot recommend to new users or new items without interaction history
- Popularity bias—tends to favor popular items over niche ones
- Limited help for users with unique preferences—performs poorly for outlier taste profiles