TF-IDF - Notes on AI

# TF-IDF TF-IDF measures how characteristic a word is of a specific document within a collection. It gives high scores to words that appear frequently in one document but rarely across other documents—capturing terms that are distinctive and meaningful for that particular document while filtering out common words that appear everywhere. $ \text { TF-IDF }(t, d)=t f(t, d) \cdot i d f(t) $ ## Term frequency Term frequency, $\operatorname{tf}(t, d)$, is the relative frequency of term $t$ within document $d$, $ \mathrm{tf}(t, d)=\frac{f_{t, d}}{\sum_{t^{\prime} \in d} f_{t^{\prime}, d}}, $ where $f_{t, d}$ is the raw count of a term in a document, i.e., the number of times that term $t$ occurs in document $d$. ## Inverse document frequency The inverse document frequency is a measure of how much information the word provides, i.e., if it is common or rare across all documents. It is the logarithmically scaled inverse fraction of the documents that contain the word (obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient): $ \operatorname{idf}(t, D)=\log \frac{N}{|\{d \in D: t \in d\}|} $ with - $N$ : total number of documents in the corpus $N=|D|$ - $|\{d \in D: t \in d\}|:$ number of documents where the term $t$ appears (i.e., $\operatorname{tf}(t, d) \neq 0$ ). If the term is not in the corpus, this will lead to a division-by-zero. It is therefore common to adjust the denominator to $1+|\{d \in D: t \in d\}|$. ## TF-IDF Then tf-idf is calculated as $ \operatorname{tfidf}(t, d, D)=\operatorname{tf}(t, d) \cdot \operatorname{idf}(t, D) $ A high weight in tf-idf is reached by a high term frequency (in the given document) and a low document frequency of the term in the whole collection of documents; the weights hence tend to filter out common terms. Since the ratio inside the idf's log function is always greater than or equal to 1, the value of idf (and tf-idf) is greater than or equal to 0 . As a term appears in more documents, the ratio inside the logarithm approaches 1 , bringing the idf and tf-idf closer to 0 . --- ## References 1. https://en.wikipedia.org/wiki/Tf%E2%80%93idf