Introduction to Text Analysis in R
  • Source Code
  • Report a Bug
  1. Text Analysis
  2. N-grams and Collocations
  • Home
  • Text Preprocessing
    • Introduction to Text Preprocessing
    • Normalization & Noise Reduction
    • Word Tokenization
    • Stop Words Removal
    • Lemmatization
    • Conclusion
  • Text Analysis
    • Text Analysis
    • Basic Word Frequencies
    • N-grams and Collocations
    • Frequency Analysis
  • Sentiment Analysis
    • What is Sentiment Analysis?
    • Polarity Classification
    • Emotion Detection
    • Final Considerations
  • About RDS

On this page

  • Creating N-grams
  • Next Word Prediction Using N-grams
  • From N-grams to Collocations
    • Identifying Collocations
    • Visualizing Collocations
    • Using Collocations for Smarter Prediction
  1. Text Analysis
  2. N-grams and Collocations

N-grams and Word Sequences

As you can notice, counting words can be useful to explore common terms in a text corpus, but it does not capture the context in which words are used. To gain deeper insights into the relationships between words, we can analyze sequences of words, known as n-grams. N-grams are contiguous sequences of ‘n’ items (words) from a given text. For example, a bigram is a sequence of two words, while a trigram is a sequence of three words.

Creating N-grams

Because creating n-grams involves tokenizing text into sequences of words, we can use the unnest_tokens() function from the tidytext package again, but this time specifying the token argument to create n-grams.

# Creating bigrams (2-grams) from the comments
ngrams <- comments %>%
  unnest_tokens(ngrams, comments, token = "ngrams", n = 2) #bigrams 

ngrams
# A tibble: 67,100 × 3
    ...1 id      ngrams          
   <dbl> <chr>   <chr>           
 1     1 s1_0001 everyone telling
 2     1 s1_0001 telling watch   
 3     1 s1_0001 watch severance 
 4     1 s1_0001 severance nobody
 5     1 s1_0001 nobody tryna    
 6     1 s1_0001 tryna watch     
 7     1 s1_0001 watch apple     
 8     2 s1_0002 can quite       
 9     2 s1_0002 quite explain   
10     2 s1_0002 explain show    
# ℹ 67,090 more rows

The resulting ngrams data frame contains bigrams extracted from the comments. Each row represents a bigram, which consists of two consecutive words from the original text.

By changing the value of n in the unnest_tokens() function, we can create trigrams (3-grams), four-grams, and so on, depending on our analysis needs.

# Creating trigrams (3-grams) from the comments
trigrams <- comments %>%
  unnest_tokens(ngrams, comments, token = "ngrams", n = 3) #trigrams
trigrams
# A tibble: 61,297 × 3
    ...1 id      ngrams                    
   <dbl> <chr>   <chr>                     
 1     1 s1_0001 everyone telling watch    
 2     1 s1_0001 telling watch severance   
 3     1 s1_0001 watch severance nobody    
 4     1 s1_0001 severance nobody tryna    
 5     1 s1_0001 nobody tryna watch        
 6     1 s1_0001 tryna watch apple         
 7     2 s1_0002 can quite explain         
 8     2 s1_0002 quite explain show        
 9     2 s1_0002 explain show severance    
10     2 s1_0002 show severance captivating
# ℹ 61,287 more rows

Next Word Prediction Using N-grams

One practical application of n-grams is in building simple predictive text models. For instance, we can create a function that predicts the next word based on a given word using bigrams.

# Function to predict the next word based on a given word using bigrams
next_word <- function(word, ngrams_df) {
    matches <- ngrams_df %>%
        separate(ngrams, into = c("w1", "w2"), sep = " ", remove = FALSE) %>%
        filter(w1 == word) %>%
        pull(w2)
    freq <- table(matches)
    nw <- max(freq)
    return(names(freq[freq == nw]))
}

This function takes a word and the n-grams data frame as inputs, finds all bigrams where the first word matches the input word, and returns the most frequently occurring second word as the predicted next word.

We can see how this function works by providing an example:

type_any_word = "ben"

next_word(type_any_word, ngrams)
[1] "stiller"

We can even play with a simple loop to see how the prediction evolves:

current_word = "wow"
for (i in 1:5) {
  predicted_word = next_word(current_word, ngrams)
  cat(current_word, "->", predicted_word, "\n")
  current_word = predicted_word
}
wow -> severance 
severance -> season 
season -> finale 
finale -> severance 
severance -> season 

If you have played with this code, you might notice that the predictions can sometimes lead to repetitive or nonsensical sequences. This is a limitation of using simple n-gram models without additional context or smoothing techniques. We can explore by using trigrams to see if predictions improve:

# Function to predict the next word based on a given two-word phrase using trigrams
next_word_trigram <- function(phrase, trigrams_df) {
    words <- unlist(strsplit(phrase, " "))
    if (length(words) != 2) {
        stop("Please provide a two-word phrase.")
    }
    matches <- trigrams_df %>%
        separate(ngrams, into = c("w1", "w2", "w3"), sep = " ", remove = FALSE) %>%
        filter(w1 == words[1], w2 == words[2]) %>%
        pull(w3)
    freq <- table(matches)
    nw <- max(freq)
    return(names(freq[freq == nw]))
}

To use this function you would provide a two-word phrase, for instance “best show”:

type_any_phrase = "best show"
next_word_trigram(type_any_phrase, trigrams)
[1] "ever" "tv"  

From N-grams to Collocations

While n-grams capture all consecutive word sequences, not all of them are equally meaningful. Collocations are word combinations that occur together more frequently than would be expected by chance. They represent meaningful multi-word expressions like “strong coffee,” “make a decision,” or in our data, perhaps “plot twist” or “character development.”

The key difference: - N-grams: mechanical extraction of all consecutive words - Collocations: statistically significant word pairs that carry specific meaning

Identifying Collocations

To find collocations, we need to measure how “associated” two words are. One common metric is Pointwise Mutual Information (PMI), which compares how often words appear together versus how often we’d expect them to appear together if they were independent.

NoteOther Collocation Metrics

While we use PMI in this workshop, there are several other statistical measures commonly used to identify collocations:

  • Chi-square (χ²): Tests the independence of two words by comparing observed vs. expected frequencies. Higher values indicate stronger association.

  • Log-likelihood ratio (G²): Similar to chi-square but more reliable for small sample sizes. Commonly used in corpus linguistics.

  • T-score: Measures the confidence in the association between two words. Less sensitive to low-frequency pairs than PMI.

  • Dice coefficient: Measures the overlap between two words’ contexts. Values range from 0 to 1.

Each metric has different strengths. PMI favors rare but strongly associated pairs, while t-score is more conservative and favors frequent collocations. The choice depends on your research goals and corpus characteristics.

First, let’s separate our bigrams and count them:

library(tidyr)

# Separate bigrams into individual words and count
bigram_counts <- ngrams %>%
  separate(ngrams, into = c("word1", "word2"), sep = " ", remove = FALSE) %>%
  count(word1, word2, sort = TRUE)

head(bigram_counts, 10)
# A tibble: 10 × 3
   word1     word2         n
   <chr>     <chr>     <int>
 1 season    finale     1793
 2 severance season     1500
 3 apple     tv          689
 4 finale    severance   532
 5 season    severance   378
 6 severance finale      376
 7 can       wait        181
 8 best      show        159
 9 second    season      159
10 severance apple       156

Now we’ll calculate PMI for each bigram. PMI is calculated as:

\[\text{PMI}(w_1, w_2) = \log_2\left(\frac{P(w_1, w_2)}{P(w_1) \times P(w_2)}\right)\]

Where:

  • \(P(w_1, w_2)\) is the probability of the bigram occurring
  • \(P(w_1)\) and \(P(w_2)\) are the probabilities of each word occurring independently
library(dplyr)

# Calculate individual word frequencies
word_freqs <- comments %>%
  unnest_tokens(word, comments) %>%
  count(word, name = "word_count")

# Total number of words in corpus
total_words <- sum(word_freqs$word_count)

# Total number of bigrams
total_bigrams <- sum(bigram_counts$n)

# Calculate PMI
collocations <- bigram_counts %>%
  left_join(word_freqs, by = c("word1" = "word")) %>%
  rename(word1_count = word_count) %>%
  left_join(word_freqs, by = c("word2" = "word")) %>%
  rename(word2_count = word_count) %>%
  mutate(
    # Probability of bigram
    p_bigram = n / total_bigrams,
    # Probability of each word
    p_word1 = word1_count / total_words,
    p_word2 = word2_count / total_words,
    # PMI calculation
    pmi = log2(p_bigram / (p_word1 * p_word2))
  ) %>%
  arrange(desc(pmi))

head(collocations, 15)
# A tibble: 15 × 9
   word1      word2     n word1_count word2_count p_bigram p_word1 p_word2   pmi
   <chr>      <chr> <int>       <int>       <int>    <dbl>   <dbl>   <dbl> <dbl>
 1 abbott     elem…     1           1           1  1.49e-5 1.37e-5 1.37e-5  16.3
 2 abrams     ente…     1           1           1  1.49e-5 1.37e-5 1.37e-5  16.3
 3 accompani… guid…     1           1           1  1.49e-5 1.37e-5 1.37e-5  16.3
 4 ace        disb…     1           1           1  1.49e-5 1.37e-5 1.37e-5  16.3
 5 acknowled… empl…     1           1           1  1.49e-5 1.37e-5 1.37e-5  16.3
 6 activates  glas…     1           1           1  1.49e-5 1.37e-5 1.37e-5  16.3
 7 adams      stal…     1           1           1  1.49e-5 1.37e-5 1.37e-5  16.3
 8 affection  rain…     1           1           1  1.49e-5 1.37e-5 1.37e-5  16.3
 9 afternoon  wear…     1           1           1  1.49e-5 1.37e-5 1.37e-5  16.3
10 al         sc        1           1           1  1.49e-5 1.37e-5 1.37e-5  16.3
11 alfred     neum…     1           1           1  1.49e-5 1.37e-5 1.37e-5  16.3
12 alia       shaw…     1           1           1  1.49e-5 1.37e-5 1.37e-5  16.3
13 ambivalent rela…     1           1           1  1.49e-5 1.37e-5 1.37e-5  16.3
14 ampex      resu…     1           1           1  1.49e-5 1.37e-5 1.37e-5  16.3
15 amy        schu…     1           1           1  1.49e-5 1.37e-5 1.37e-5  16.3

High PMI values indicate strong collocations, that means word pairs that appear together much more than chance would predict.

Visualizing Collocations

Let’s visualize the strongest collocations to see what meaningful phrases emerge from our Severance comments:

library(ggplot2)

# Top 20 collocations by PMI
top_collocations <- collocations %>%
  head(20) %>%
  unite(bigram, word1, word2, sep = " ")

ggplot(top_collocations, aes(x = reorder(bigram, pmi), y = pmi)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(
    title = "Top 20 Collocations by PMI",
    x = "Bigram",
    y = "Pointwise Mutual Information"
  ) +
  theme_minimal()

Using Collocations for Smarter Prediction

Remember our simple n-gram predictor that sometimes got stuck in loops? We can create a more “intelligent” predictor using collocations instead of raw frequency counts. The idea is simple: instead of picking the most frequent next word, we pick the word with the highest PMI (strongest association).

# Function to predict next word using collocation strength (PMI)
next_word_collocation <- function(word, collocations_df, min_freq = 2) {
    candidates <- collocations_df %>%
        filter(word1 == word, n >= min_freq, pmi > 0) %>%
        arrange(desc(pmi))
    
    # Return the word with highest PMI, or NA if no matches
    if (nrow(candidates) > 0) {
        return(candidates$word2[1])
    } else {
        return(NA)
    }
}

Let’s compare the two approaches side by side:

# Compare frequency-based vs. collocation-based prediction
test_word <- "mark"

freq_prediction <- next_word(test_word, ngrams)
colloc_prediction <- next_word_collocation(test_word, collocations)

cat("Frequency-based predictor:", test_word, "->", freq_prediction, "\n")
Frequency-based predictor: mark -> helly 
cat("Collocation-based predictor:", test_word, "->", colloc_prediction, "\n")
Collocation-based predictor: mark -> completing 

Now let’s run both predictors in a loop and see which produces more meaningful sequences:

# Frequency-based prediction
current_word <- "wow"
for (i in 1:10) {
  predicted_word <- next_word(current_word, ngrams)
  cat(current_word, "->", predicted_word, "\n")
  current_word <- predicted_word
}
wow -> severance 
severance -> season 
season -> finale 
finale -> severance 
severance -> season 
season -> finale 
finale -> severance 
severance -> season 
season -> finale 
finale -> severance 
current_word <- "wow"
for (i in 1:10) {
  predicted_word <- next_word_collocation(current_word, collocations)
  if (is.na(predicted_word)) {
    cat(current_word, "-> (no strong collocation found)\n")
    break
  }
  cat(current_word, "->", predicted_word, "\n")
  current_word <- predicted_word
}
wow -> intense 
intense -> exploding 
exploding -> head 
head -> contenders 
contenders -> tv 
tv -> announces 
announces -> severance 
severance -> tvtime 
tvtime -> (no strong collocation found)

As you can notice, both approaches are similar in structure, both are looking for the next word based on the current word. However, the collocation-based predictor leverages statistical associations between words, potentially leading to more contextually relevant predictions. This is an example of how different text analysis techniques can produce varying results based on the underlying data and methods used.

Source Code
---
title: "N-grams and Word Sequences"
engine: knitr
format:
  html:
    fig-width: 10
    fig-height: 12
    dpi: 300
editor_options: 
  chunk_output_type: inline
---

```{r}
#| include: false
# This is just to render the document correctly in the CI/CD pipeline
library(tidyverse)
library(tidytext)

comments <- readr::read_csv("../../data/clean/comments_preprocessed.csv") 
```

As you can notice, counting words can be useful to explore common terms in a text corpus, but it does not capture the context in which words are used. To gain deeper insights into the relationships between words, we can analyze sequences of words, known as **n-grams**. N-grams are contiguous sequences of 'n' items (words) from a given text. For example, a bigram is a sequence of two words, while a trigram is a sequence of three words.

## Creating N-grams

Because creating n-grams involves tokenizing text into sequences of words, we can use the `unnest_tokens()` function from the `tidytext` package again, but this time specifying the `token` argument to create n-grams.

```{r}
# Creating bigrams (2-grams) from the comments
ngrams <- comments %>%
  unnest_tokens(ngrams, comments, token = "ngrams", n = 2) #bigrams 

ngrams
```

The resulting `ngrams` data frame contains bigrams extracted from the comments. Each row represents a bigram, which consists of two consecutive words from the original text.

By changing the value of `n` in the `unnest_tokens()` function, we can create trigrams (3-grams), four-grams, and so on, depending on our analysis needs.

```{r}
# Creating trigrams (3-grams) from the comments
trigrams <- comments %>%
  unnest_tokens(ngrams, comments, token = "ngrams", n = 3) #trigrams
trigrams
```

## Next Word Prediction Using N-grams

One practical application of n-grams is in building simple predictive text models. For instance, we can create a function that predicts the next word based on a given word using bigrams.

```{r}
# Function to predict the next word based on a given word using bigrams
next_word <- function(word, ngrams_df) {
    matches <- ngrams_df %>%
        separate(ngrams, into = c("w1", "w2"), sep = " ", remove = FALSE) %>%
        filter(w1 == word) %>%
        pull(w2)
    freq <- table(matches)
    nw <- max(freq)
    return(names(freq[freq == nw]))
}
```

This function takes a word and the n-grams data frame as inputs, finds all bigrams where the first word matches the input word, and returns the most frequently occurring second word as the predicted next word.

We can see how this function works by providing an example:

```{r}
type_any_word = "ben"

next_word(type_any_word, ngrams)
```

We can even play with a simple loop to see how the prediction evolves:

```{r}
current_word = "wow"
for (i in 1:5) {
  predicted_word = next_word(current_word, ngrams)
  cat(current_word, "->", predicted_word, "\n")
  current_word = predicted_word
}
```

If you have played with this code, you might notice that the predictions can sometimes lead to repetitive or nonsensical sequences. This is a limitation of using simple n-gram models without additional context or smoothing techniques. We can explore by using trigrams to see if predictions improve:

```{r}
# Function to predict the next word based on a given two-word phrase using trigrams
next_word_trigram <- function(phrase, trigrams_df) {
    words <- unlist(strsplit(phrase, " "))
    if (length(words) != 2) {
        stop("Please provide a two-word phrase.")
    }
    matches <- trigrams_df %>%
        separate(ngrams, into = c("w1", "w2", "w3"), sep = " ", remove = FALSE) %>%
        filter(w1 == words[1], w2 == words[2]) %>%
        pull(w3)
    freq <- table(matches)
    nw <- max(freq)
    return(names(freq[freq == nw]))
}
```

To use this function you would provide a two-word phrase, for instance "best show":

```{r}
type_any_phrase = "best show"
next_word_trigram(type_any_phrase, trigrams)
```

## From N-grams to Collocations

While n-grams capture all consecutive word sequences, not all of them are equally meaningful. **Collocations** are word combinations that occur together more frequently than would be expected by chance. They represent meaningful multi-word expressions like "strong coffee," "make a decision," or in our data, perhaps "plot twist" or "character development."

The key difference:
- **N-grams**: mechanical extraction of all consecutive words
- **Collocations**: statistically significant word pairs that carry specific meaning

### Identifying Collocations

To find collocations, we need to measure how "associated" two words are. One common metric is **Pointwise Mutual Information (PMI)**, which compares how often words appear together versus how often we'd expect them to appear together if they were independent.

::: {.callout-note title="Other Collocation Metrics" collapse="true"}
While we use PMI in this workshop, there are several other statistical measures commonly used to identify collocations:

- **Chi-square (χ²)**: Tests the independence of two words by comparing observed vs. expected frequencies. Higher values indicate stronger association.

- **Log-likelihood ratio (G²)**: Similar to chi-square but more reliable for small sample sizes. Commonly used in corpus linguistics.

- **T-score**: Measures the confidence in the association between two words. Less sensitive to low-frequency pairs than PMI.

- **Dice coefficient**: Measures the overlap between two words' contexts. Values range from 0 to 1.

Each metric has different strengths. PMI favors rare but strongly associated pairs, while t-score is more conservative and favors frequent collocations. The choice depends on your research goals and corpus characteristics.
:::

First, let's separate our bigrams and count them:

```{r}
library(tidyr)

# Separate bigrams into individual words and count
bigram_counts <- ngrams %>%
  separate(ngrams, into = c("word1", "word2"), sep = " ", remove = FALSE) %>%
  count(word1, word2, sort = TRUE)

head(bigram_counts, 10)
```

Now we'll calculate PMI for each bigram. PMI is calculated as:

$$\text{PMI}(w_1, w_2) = \log_2\left(\frac{P(w_1, w_2)}{P(w_1) \times P(w_2)}\right)$$

Where:

- $P(w_1, w_2)$ is the probability of the bigram occurring
- $P(w_1)$ and $P(w_2)$ are the probabilities of each word occurring independently

```{r}
library(dplyr)

# Calculate individual word frequencies
word_freqs <- comments %>%
  unnest_tokens(word, comments) %>%
  count(word, name = "word_count")

# Total number of words in corpus
total_words <- sum(word_freqs$word_count)

# Total number of bigrams
total_bigrams <- sum(bigram_counts$n)

# Calculate PMI
collocations <- bigram_counts %>%
  left_join(word_freqs, by = c("word1" = "word")) %>%
  rename(word1_count = word_count) %>%
  left_join(word_freqs, by = c("word2" = "word")) %>%
  rename(word2_count = word_count) %>%
  mutate(
    # Probability of bigram
    p_bigram = n / total_bigrams,
    # Probability of each word
    p_word1 = word1_count / total_words,
    p_word2 = word2_count / total_words,
    # PMI calculation
    pmi = log2(p_bigram / (p_word1 * p_word2))
  ) %>%
  arrange(desc(pmi))

head(collocations, 15)
```

High PMI values indicate strong collocations, that means word pairs that appear together much more than chance would predict.

### Visualizing Collocations

Let's visualize the strongest collocations to see what meaningful phrases emerge from our Severance comments:

```{r}
library(ggplot2)

# Top 20 collocations by PMI
top_collocations <- collocations %>%
  head(20) %>%
  unite(bigram, word1, word2, sep = " ")

ggplot(top_collocations, aes(x = reorder(bigram, pmi), y = pmi)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(
    title = "Top 20 Collocations by PMI",
    x = "Bigram",
    y = "Pointwise Mutual Information"
  ) +
  theme_minimal()
```

### Using Collocations for Smarter Prediction

Remember our simple n-gram predictor that sometimes got stuck in loops? We can create a more "intelligent" predictor using collocations instead of raw frequency counts. The idea is simple: instead of picking the most frequent next word, we pick the word with the highest PMI (strongest association).

```{r}
# Function to predict next word using collocation strength (PMI)
next_word_collocation <- function(word, collocations_df, min_freq = 2) {
    candidates <- collocations_df %>%
        filter(word1 == word, n >= min_freq, pmi > 0) %>%
        arrange(desc(pmi))
    
    # Return the word with highest PMI, or NA if no matches
    if (nrow(candidates) > 0) {
        return(candidates$word2[1])
    } else {
        return(NA)
    }
}
```

Let's compare the two approaches side by side:

```{r}
# Compare frequency-based vs. collocation-based prediction
test_word <- "mark"

freq_prediction <- next_word(test_word, ngrams)
colloc_prediction <- next_word_collocation(test_word, collocations)

cat("Frequency-based predictor:", test_word, "->", freq_prediction, "\n")
cat("Collocation-based predictor:", test_word, "->", colloc_prediction, "\n")
```

Now let's run both predictors in a loop and see which produces more meaningful sequences:

```{r}
# Frequency-based prediction
current_word <- "wow"
for (i in 1:10) {
  predicted_word <- next_word(current_word, ngrams)
  cat(current_word, "->", predicted_word, "\n")
  current_word <- predicted_word
}

current_word <- "wow"
for (i in 1:10) {
  predicted_word <- next_word_collocation(current_word, collocations)
  if (is.na(predicted_word)) {
    cat(current_word, "-> (no strong collocation found)\n")
    break
  }
  cat(current_word, "->", predicted_word, "\n")
  current_word <- predicted_word
}
```

As you can notice, both approaches are similar in structure, both are looking for the next word based on the current word. However, the collocation-based predictor leverages statistical associations between words, potentially leading to more contextually relevant predictions. This is an example of how different text analysis techniques can produce varying results based on the underlying data and methods used.

UCSB Library Research Data Services logo

This website is built with Quarto, RStudio/Posit, and webexercises R package. UCSB Library Research Data Services. CC BY 4.0