Search code examples

Quanteda: how to get ngrams, and their frequences, given n-1 predecessor words/types

For next word prediction using ngrams I would need to find all the ngrams (and their frequencies) given n-1 predecessor words.
In dfm I could not see any way to do that, so started implementing it manually on texstat_frequency (data.frame).
After bumping in some methods whose documentation is not clear to me in this page wonder whether there is a way and it's just me unable to see it (maybe one of the "[" methods that are listed but not described in a way I understand there) hence this question.
(Implicitly maybe wrongly excluding using regexes, that I normally love, becauses of prejudice that running them on hundred thousands strings might be too slow/heavy)

Looked into fcm() as suggested in comment, but I am only able to get ngrams that follow ngrams, like in code below, this is not what I asked as it works only for n = 2 (and requires subsetting the resulting matrix to the given (n-1)gram).

txt <- c("a b 1 2 3 a b 2 3 4 a b 3 4 5")
fcm(tokens(txt, ngram = 2), "window", window = 1, ordered = T)
Feature co-occurrence matrix of: 10 by 10 features.
10 x 10 sparse Matrix of class "fcm"
features a_b b_1 1_2 2_3 3_a b_2 3_4 4_a b_3 4_5
     a_b   0   1   0   0   0   1   0   0   1   0
     b_1   0   0   1   0   0   0   0   0   0   0
     1_2   0   0   0   1   0   0   0   0   0   0
     2_3   0   0   0   0   1   0   1   0   0   0
     3_a   1   0   0   0   0   0   0   0   0   0
     b_2   0   0   0   1   0   0   0   0   0   0
     3_4   0   0   0   0   0   0   0   1   0   1
     4_a   1   0   0   0   0   0   0   0   0   0
     b_3   0   0   0   0   0   0   1   0   0   0
     4_5   0   0   0   0   0   0   0   0   0   0 

Above code uses quanteda installed from github 20 Aug 2018 that should contain this fix generated by this question

[1] ‘1.3.5’


  • Package contributor kindly provided sample code (here) that shows how to achieve what I asked, for text not too large. I reproduce here that code with some simplifications and comments to make it as easy to understand as possible

    sample_code <- function() {
      print(paste("based on",""))
      print("great package great support, thanks")
      ngms <- tokens("a b 1 2 3 a b 2 3 4 a b 3 4 5", n = 2:5)
      # get rid of tokens metadata not necessary for our UC
      ngms_lst <-  as.list(ngms)
      ngms_unlst  <- unlist(ngms_lst) # (named) character with _ sep. ngrams
      # split in " "-separated pairs:  "n-1 tokens", "nth token"
      ngms_blank_sep <- stringi::stri_replace_last_fixed(ngms_unlst,"_", " ")
      # list of character(2)  ( (n-1)gram ,nth token )
      tk2_lst <- tokens(ngms_blank_sep)
      # --- end of tokens/ngrams pre-processing
      # ordinary fcm
      fcm_ord <- fcm(tk2_lst , ordered = TRUE)
      fcm_ord[33:39, 1:6]
    [1] "based on"
    [1] "great package great support, thanks"
    Feature co-occurrence matrix of: 7 by 6 features.
    7 x 6 sparse Matrix of class "fcm"
    features  a b 1 2 3 4
      3_a_b_2 0 0 0 0 1 0
      a_b_2_3 0 0 0 0 0 1
      b_2_3_4 1 0 0 0 0 0
      2_3_4_a 0 1 0 0 0 0
      3_4_a_b 0 0 0 0 1 0
      4_a_b_3 0 0 0 0 0 1
      a_b_3_4 0 0 0 0 0 0