TF-IDF: Finding Distinctive Vocabulary

So far, we have explored word frequencies and n-grams to understand common terms and phrases in our text data. However, simply counting words has a limitation: some words are frequent because they appear often across all documents, not because they are particularly meaningful for a specific document or group.

For example, in our Severance dataset, words like “season,” “episode,” and “show” might appear frequently in comments about both Season 1 and Season 2. While these words are common, they don’t help us understand what makes each season’s discussion distinctive.

This is where TF-IDF (Term Frequency-Inverse Document Frequency) becomes useful. TF-IDF is a statistical measure that evaluates how important a word is to a document in a collection of documents (corpus). It helps us identify words that are frequent in one document but rare across the entire corpus—precisely the words that make a document unique.

Understanding TF-IDF

TF-IDF combines two metrics:

Term Frequency (TF): How often a word appears in a document
Inverse Document Frequency (IDF): How rare a word is across all documents

The formula is:

\[\text{TF-IDF} = \text{TF} \times \text{IDF}\]

Where:

\[\text{TF}(t, d) = \frac{\text{count of term } t \text{ in document } d}{\text{total terms in document } d}\]

\[\text{IDF}(t) = \log\left(\frac{\text{total number of documents}}{\text{number of documents containing term } t}\right)\]

A word gets a high TF-IDF score when:

It appears frequently in a particular document (high TF)
It appears in few other documents (high IDF)

A word gets a low TF-IDF score when:

It appears in many documents (low IDF), even if it’s frequent in one document

Calculating TF-IDF

In our case, we want to compare the vocabulary between Season 1 and Season 2 comments. We’ll treat each season as a “document” and calculate TF-IDF to find which words are distinctive to each season.

First, we need to extract season information from the id column and tokenize the comments:

# Calculate TF-IDF by season
comments_tfidf <- comments %>%
  mutate(season = str_extract(id, "s[12]")) %>%  # Extract season (s1 or s2)
  unnest_tokens(word, comments) %>%               # Tokenize into words
  count(season, word, sort = TRUE)                # Count words per season

head(comments_tfidf)

# A tibble: 6 × 3
  season word          n
  <chr>  <chr>     <int>
1 s2     severance  4200
2 s2     season     3219
3 s1     severance  1840
4 s2     finale     1833
5 s1     season     1261
6 s2     show       1020

Now we can apply the bind_tf_idf() function from the tidytext package, which automatically calculates TF, IDF, and TF-IDF for us:

# Apply TF-IDF calculation
comments_tfidf <- comments_tfidf %>%
  bind_tf_idf(word, season, n)

head(comments_tfidf, 15)

# A tibble: 15 × 6
   season word          n      tf   idf tf_idf
   <chr>  <chr>     <int>   <dbl> <dbl>  <dbl>
 1 s2     severance  4200 0.0809      0      0
 2 s2     season     3219 0.0620      0      0
 3 s1     severance  1840 0.0876      0      0
 4 s2     finale     1833 0.0353      0      0
 5 s1     season     1261 0.0601      0      0
 6 s2     show       1020 0.0196      0      0
 7 s1     finale      787 0.0375      0      0
 8 s2     tv          658 0.0127      0      0
 9 s1     show        632 0.0301      0      0
10 s2     apple       586 0.0113      0      0
11 s2     just        431 0.00830     0      0
12 s1     tv          402 0.0191      0      0
13 s2     like        354 0.00681     0      0
14 s2     can         321 0.00618     0      0
15 s1     apple       304 0.0145      0      0

The resulting data frame includes:

tf: Term frequency (proportion of times the word appears in that season)
idf: Inverse document frequency (how rare the word is across seasons)
tf_idf: The product of TF and IDF

Let’s examine the top words by TF-IDF for each season:

# Top 10 distinctive words per season
comments_tfidf %>%
  group_by(season) %>%
  slice_max(tf_idf, n = 10)

# A tibble: 20 × 6
# Groups:   season [2]
   season word             n       tf   idf   tf_idf
   <chr>  <chr>        <int>    <dbl> <dbl>    <dbl>
 1 s1     wonderfully     12 0.000571 0.693 0.000396
 2 s1     tvtime          10 0.000476 0.693 0.000330
 3 s1     crashed          8 0.000381 0.693 0.000264
 4 s1     grey             8 0.000381 0.693 0.000264
 5 s1     mets             8 0.000381 0.693 0.000264
 6 s1     ptolemy          8 0.000381 0.693 0.000264
 7 s1     captivating      6 0.000286 0.693 0.000198
 8 s1     dga              6 0.000286 0.693 0.000198
 9 s1     exhilarating     6 0.000286 0.693 0.000198
10 s1     held             5 0.000238 0.693 0.000165
11 s2     formally        84 0.00162  0.693 0.00112 
12 s2     records         81 0.00156  0.693 0.00108 
13 s2     cold            72 0.00139  0.693 0.000961
14 s2     harbor          68 0.00131  0.693 0.000907
15 s2     band            60 0.00116  0.693 0.000801
16 s2     marching        51 0.000982 0.693 0.000681
17 s2     march           45 0.000866 0.693 0.000600
18 s2     silo            41 0.000789 0.693 0.000547
19 s2     choice          39 0.000751 0.693 0.000520
20 s2     voting          37 0.000712 0.693 0.000494

Notice how these words are much more specific and meaningful than simply looking at the most frequent words. These are the words that truly characterize each season’s discussion.

Visualizing Distinctive Vocabulary

To better understand the distinctive vocabulary of each season, we can create a visualization comparing the top TF-IDF words:

# Prepare data for visualization
top_tfidf_words <- comments_tfidf %>%
  group_by(season) %>%
  slice_max(tf_idf, n = 15) %>%
  ungroup() %>%
  mutate(word = reorder_within(word, tf_idf, season))

# Plot distinctive vocabulary by season
ggplot(top_tfidf_words, aes(tf_idf, word, fill = season)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~season, scales = "free") +
  scale_y_reordered() +
  labs(
    x = "TF-IDF",
    y = NULL,
    title = "Distinctive Vocabulary by Season"
  ) +
  theme_minimal()

This visualization clearly shows which words are most characteristic of each season’s discussions. Words with higher TF-IDF scores are those that appear frequently in one season but not in the other, making them useful markers of distinctive content.

Comparing TF-IDF to Raw Frequency

To appreciate the value of TF-IDF, let’s compare it to simple word counts. We’ll look at the top words by frequency versus the top words by TF-IDF for Season 1:

# Top words by raw frequency for Season 1
top_freq_s1 <- comments %>%
  filter(grepl("^s1", id)) %>%
  unnest_tokens(word, comments) %>%
  count(word, sort = TRUE) %>%
  head(15)

# Top words by TF-IDF for Season 1
top_tfidf_s1 <- comments_tfidf %>%
  filter(season == "s1") %>%
  arrange(desc(tf_idf)) %>%
  head(15)

# Top 15 words by frequency (Season 1)
print(top_freq_s1)

# A tibble: 15 × 2
   word          n
   <chr>     <int>
 1 severance  1840
 2 season     1261
 3 finale      787
 4 show        632
 5 tv          402
 6 apple       304
 7 best        276
 8 can         220
 9 just        208
10 wait        203
11 one         193
12 watch       191
13 now         153
14 good        148
15 seen        142

# Top 15 words by TF-IDF (Season 1)
print(top_tfidf_s1 %>% select(word, n, tf_idf))

# A tibble: 15 × 3
   word             n   tf_idf
   <chr>        <int>    <dbl>
 1 wonderfully     12 0.000396
 2 tvtime          10 0.000330
 3 crashed          8 0.000264
 4 grey             8 0.000264
 5 mets             8 0.000264
 6 ptolemy          8 0.000264
 7 captivating      6 0.000198
 8 dga              6 0.000198
 9 exhilarating     6 0.000198
10 held             5 0.000165
11 alpha            4 0.000132
12 computers        4 0.000132
13 gary             4 0.000132
14 mesmerized       4 0.000132
15 moon             4 0.000132

The raw frequency list likely includes many words that are common across both seasons, while the TF-IDF list highlights words that are specifically important to Season 1 discussions.

When to Use TF-IDF

TF-IDF is particularly useful for:

Document comparison: Identifying what makes each document unique in a collection
Feature extraction: Preparing text data for machine learning by emphasizing distinctive words
Topic discovery: Finding characteristic vocabulary for different groups or categories
Search and retrieval: Ranking documents by relevance to a query (search engines use variations of TF-IDF)

Limitations of TF-IDF

While TF-IDF is powerful, it has some limitations:

No semantic understanding: It treats words as independent units and doesn’t understand synonyms or context
Corpus dependency: TF-IDF scores depend on the entire corpus, so adding or removing documents changes the scores
Document length bias: Can be affected by document length differences (though this is partially addressed by normalization)

For more advanced semantic analysis, techniques like word embeddings or transformer models might be more appropriate.

TF-IDF bridges the gap between simple word counting and more sophisticated text analysis techniques. By weighing words based on both their local importance (in a document) and their global rarity (across the corpus), it helps us discover the vocabulary that truly distinguishes different parts of our text data.