Search code examples
rnlptext-miningn-gramquanteda

Table of n-grams and identifying the row in which the text appeared


I want to construct a table in which n-grams appear as a column and the row numbers of the dataframe from which they were constructed.

For example, the below code was used to construct n-grams (quadgram in this case):

# Libraries
library(quanteda)
library(data.table)
library(tidyverse)
library(stringr)

# Dataframe
Data <- data.frame(Column1 = c(1.222, 3.445, 5.621, 8.501, 9.302), 
                  Column2 = c(654231, 12347, -2365, 90000, 12897), 
                  Column3 = c('A1', 'B2', 'E3', 'C1', 'F5'), 
                  Column4 = c('I bought it', 'The flower has a beautiful fragrance', 'It was bought by me', 'I have bought it', 'The flower smells good'), 
                  Column5 = c('Good', 'Bad', 'Ok', 'Moderate', 'Perfect'))

# Text column of interest
TextColumn <- Data$Column4

# Corpus
Content <-  corpus(TextColumn)

# Tokenization
Tokens <- tokens(Content, what = "word",
                               remove_punct = TRUE,
                               remove_symbols = TRUE,
                               remove_numbers = FALSE,
                               remove_url = TRUE,
                               remove_separators = TRUE,
                               split_hyphens = FALSE,
                               include_docvars = TRUE,
                               padding = FALSE)

Tokens <- tokens_tolower(Tokens)

# n-grams

quadgrams <- dfm(tokens_ngrams(Tokens, n = 4))
quadgrams_freq <- textstat_frequency(quadgrams)                  # quadgram frequency
quadgrs <- subset(quadgrams_freq,select=c(feature,frequency))
names(quadgrs) <- c("ngram","freq")
quadgrs <- as.data.table(quadgrs)

The result is

enter image description here

Is there a way to extract the row number too from which the words were considered from Column4. For example, a column containing 2 (row number) must be there in the above table corresponding to "the_flower_has_a" and again 2 (row number) as an entry for "flower_has_a_beautiful" etc.


Solution

  • You can specify a group in textstat_frequency() that corresponds to the group name, and this will provide a reference to your original "row number".

    library("quanteda")
    ## Package version: 2.1.2
    
    library("data.table")
    
    # Dataframe
    Data <- data.frame(
      Column1 = c(1.222, 3.445, 5.621, 8.501, 9.302),
      Column2 = c(654231, 12347, -2365, 90000, 12897),
      Column3 = c("A1", "B2", "E3", "C1", "F5"),
      Column4 = c("I bought it", "The flower has a beautiful fragrance", "It was bought by me", "I have bought it", "The flower smells good"),
      Column5 = c("Good", "Bad", "Ok", "Moderate", "Perfect")
    )
    
    # Corpus
    Content <- corpus(Data, text_field = "Column4")
    docnames(Content) <- seq_len(nrow(Data))
    
    # Tokenization and ngrams
    Tokens <- tokens(Content,
      what = "word",
      remove_punct = TRUE,
      remove_symbols = TRUE,
      remove_url = TRUE
    ) %>%
      tokens_tolower() %>%
      tokens_ngrams(n = 4)
    

    Now comes the groups part:

    # form the result
    quadgrs <- textstat_frequency(dfm(Tokens), groups = docnames(Tokens)) %>%
      as.data.table()
    setnames(quadgrs, "group", "rownumber")
    
    quadgrs[, c("feature", "frequency", "rownumber")]
    ##                      feature frequency rownumber
    ## 1:          the_flower_has_a         1         2
    ## 2:    flower_has_a_beautiful         1         2
    ## 3: has_a_beautiful_fragrance         1         2
    ## 4:          it_was_bought_by         1         3
    ## 5:          was_bought_by_me         1         3
    ## 6:          i_have_bought_it         1         4
    ## 7:    the_flower_smells_good         1         5
    

    Note that:

    1. I simplified your code a bit, since some of it was unnecessary or could be streamlined.
    2. The frequency counts are now within row (document), so if you have the same ngram in multiple rows, it will occur more than once in the output table, with the frequency within the row. If you want to repeat the overall frequency for an ngram that occurs in multiple rows, then this code could be easily modified to reflect that. (Let me know if you wanted that.)