Search code examples
rnlpn-gramquantedadfm

Creating document-feature matrix takes very long in R


I am trying to create a document feature matrix with character-level bigrams in R. The last line of my code takes forever to run and never finishes. The other lines take less than a minute max. I am not sure what to do. Any advice would be appreciated.

Code:

library(quanteda)
#Tokenise corpus by characters
character_level_tokens = quanteda::tokens(corpus, 
                                what = "character",
                                remove_punct = T,
                                remove_symbols = T,
                                remove_numbers = T,
                                remove_url = T,
                                remove_separators = T, 
                                split_hyphens = T)

#Convert tokens to characters
character_level_tokens = as.character(character_level_tokens)

#Keep A-Z, a-z letters
character_level_tokens = gsub("[^A-Za-z]","",character_level_tokens)

#Extract character-level bigrams
final_data_char_bigram = char_ngrams(character_level_tokens, n = 2L, concatenator = "")

#Create document-feature matrix (DFM)
dfm.final_data_char_bigram = dfm(final_data_char_bigram)


length(final_data_char_bigram)
[1] 37115571

head(final_data_char_bigram)
[1] "lo" "ov" "ve" "el" "ly" "yt"



Solution

  • I don't have your input corpus or a reproducible example, but here's how to get the result you want. I'd be very surprised if this does not work just fine on your larger corpus too. The first method uses selection and ngram construction in quantead, while the second makes use of the character shingle tokenizer from the tokenizers package.

    library("quanteda")
    ## Package version: 2.0.1
    
    dfm.final_data_char_bigram <- data_corpus_inaugural %>%
      tokens(what = "character") %>%
      tokens_keep("[A-Za-z]", valuetype = "regex") %>%
      tokens_ngrams(n = 2, concatenator = "") %>%
      dfm()
    
    dfm.final_data_char_bigram
    ## Document-feature matrix of: 58 documents, 545 features (26.4% sparse) and 4 docvars.
    ##                  features
    ## docs              fe el ll lo ow wc ci  it  ti iz
    ##   1789-Washington 20 31 34 12 15  3 29  85 118  5
    ##   1793-Washington  1  1  7  1  4  1  2   8  12  1
    ##   1797-Adams      24 52 44 25 24  3 23 160 214  7
    ##   1801-Jefferson  34 49 60 35 31  7 34  91 116  8
    ##   1805-Jefferson  26 57 64 27 37  8 34 130 163 11
    ##   1809-Madison    11 29 37 15 17  1 21  62  82  3
    ## [ reached max_ndoc ... 52 more documents, reached max_nfeat ... 535 more features ]
    
    
    # another way
    
    dfm.final_data_char_bigram2 <- data_corpus_inaugural %>%
      tokenizers::tokenize_character_shingles(n = 2) %>%
      as.tokens() %>%
      dfm()
    
    dfm.final_data_char_bigram2
    ## Document-feature matrix of: 58 documents, 701 features (41.9% sparse).
    ##                  features
    ## docs              fe el ll lo ow wc ci  it  ti iz
    ##   1789-Washington 20 31 34 12 15  3 29  85 118  5
    ##   1793-Washington  1  1  7  1  4  1  2   8  12  1
    ##   1797-Adams      24 52 44 25 24  3 23 160 214  7
    ##   1801-Jefferson  34 49 60 35 31  7 34  91 116  8
    ##   1805-Jefferson  26 57 64 27 37  8 34 130 163 11
    ##   1809-Madison    11 29 37 15 17  1 21  62  82  3
    ## [ reached max_ndoc ... 52 more documents, reached max_nfeat ... 691 more features ]