Search code examples
rn-gramtidytext

Split text into ngrams without overlap in R


I have a dataframe where one column contains a lengthy transcript. I want to use unnest_tokens to split the transcript into ngrams of 50 words. The following code will split the transcripts:

content <- data.frame(channel=c("NBC"), program=c("A"), transcript=c("This is a rather unusual glossary in that all of the words on the list are essentially synonymous - they are nouns meaning nonsense, gibberish, claptrap, hogwash, rubbish ... you get the idea. It probably shouldn't be surprising that this category is so productive of weird words. After all, what better way to disparage someone's ideas than to combine some nonsense syllables to make a descriptor for them? You more or less always can identify their meaning from context alone - either they're used as interjections, preceded by words like 'such' or 'unadulterated' or 'ridiculous'. But which to choose? You have the reduplicated ones (fiddle-faddle), the pseudo-classical (brimborion), the ones that literally mean something repulsive (spinach), and of course the wide variety that are euphemisms for bodily functions. Excluded from this list are the wide variety of very fun terms that are simple vulgarities without any specific reference to nonsense."))

content_ngram <- content %>%
  unnest_tokens(output=sentence, input=transcript, token="ngrams", n=50)

Because this particular transcript is 100 words long, the resulting dataframe includes 100 observations, where the first engram contains the first 50 words, the second includes the 2nd through 51st words, and so on. Instead, I would like the split the transcript into non-overlapping ngrams. In the above example, I want a dataframe with two observations where the first includes an ngram with words 1-50 and the second observations includes an ngram with words 51-100.


Solution

  • One option open to you is to tokenize to single words and then concatenate back up to the chunks you are interested in. This might be a better fit, because n-gram tokenization does by definition overlap.

    library(tidyverse)
    library(tidytext)
    
    content <- tibble(channel = c("NBC"), 
                      program = c("A"), 
                      transcript = c("This is a rather unusual glossary in that all of the words on the list are essentially synonymous - they are nouns meaning nonsense, gibberish, claptrap, hogwash, rubbish ... you get the idea. It probably shouldn't be surprising that this category is so productive of weird words. After all, what better way to disparage someone's ideas than to combine some nonsense syllables to make a descriptor for them? You more or less always can identify their meaning from context alone - either they're used as interjections, preceded by words like 'such' or 'unadulterated' or 'ridiculous'. But which to choose? You have the reduplicated ones (fiddle-faddle), the pseudo-classical (brimborion), the ones that literally mean something repulsive (spinach), and of course the wide variety that are euphemisms for bodily functions. Excluded from this list are the wide variety of very fun terms that are simple vulgarities without any specific reference to nonsense."))
    
    content %>%
      unnest_tokens(output = sentence, 
                    input = transcript) %>%
      group_by(channel, program, observation = row_number() %/% 100) %>%
      summarise(sentence = str_c(sentence, collapse = " ")) %>%
      ungroup
    
    #> # A tibble: 2 x 4
    #>   channel program observation sentence                                     
    #>   <chr>   <chr>         <dbl> <chr>                                        
    #> 1 NBC     A                 0 this is a rather unusual glossary in that al…
    #> 2 NBC     A                 1 reduplicated ones fiddle faddle the pseudo c…
    

    Created on 2019-12-13 by the reprex package (v0.3.0)