I am trying to create a document feature matrix with character-level bigrams in R. The last line of my code takes forever to run and never finishes. The other lines take less than a minute max. I am not sure what to do. Any advice would be appreciated.
Code:
library(quanteda)
#Tokenise corpus by characters
character_level_tokens = quanteda::tokens(corpus,
what = "character",
remove_punct = T,
remove_symbols = T,
remove_numbers = T,
remove_url = T,
remove_separators = T,
split_hyphens = T)
#Convert tokens to characters
character_level_tokens = as.character(character_level_tokens)
#Keep A-Z, a-z letters
character_level_tokens = gsub("[^A-Za-z]","",character_level_tokens)
#Extract character-level bigrams
final_data_char_bigram = char_ngrams(character_level_tokens, n = 2L, concatenator = "")
#Create document-feature matrix (DFM)
dfm.final_data_char_bigram = dfm(final_data_char_bigram)
length(final_data_char_bigram)
[1] 37115571
head(final_data_char_bigram)
[1] "lo" "ov" "ve" "el" "ly" "yt"
I don't have your input corpus or a reproducible example, but here's how to get the result you want. I'd be very surprised if this does not work just fine on your larger corpus too. The first method uses selection and ngram construction in quantead, while the second makes use of the character shingle tokenizer from the tokenizers package.
library("quanteda")
## Package version: 2.0.1
dfm.final_data_char_bigram <- data_corpus_inaugural %>%
tokens(what = "character") %>%
tokens_keep("[A-Za-z]", valuetype = "regex") %>%
tokens_ngrams(n = 2, concatenator = "") %>%
dfm()
dfm.final_data_char_bigram
## Document-feature matrix of: 58 documents, 545 features (26.4% sparse) and 4 docvars.
## features
## docs fe el ll lo ow wc ci it ti iz
## 1789-Washington 20 31 34 12 15 3 29 85 118 5
## 1793-Washington 1 1 7 1 4 1 2 8 12 1
## 1797-Adams 24 52 44 25 24 3 23 160 214 7
## 1801-Jefferson 34 49 60 35 31 7 34 91 116 8
## 1805-Jefferson 26 57 64 27 37 8 34 130 163 11
## 1809-Madison 11 29 37 15 17 1 21 62 82 3
## [ reached max_ndoc ... 52 more documents, reached max_nfeat ... 535 more features ]
# another way
dfm.final_data_char_bigram2 <- data_corpus_inaugural %>%
tokenizers::tokenize_character_shingles(n = 2) %>%
as.tokens() %>%
dfm()
dfm.final_data_char_bigram2
## Document-feature matrix of: 58 documents, 701 features (41.9% sparse).
## features
## docs fe el ll lo ow wc ci it ti iz
## 1789-Washington 20 31 34 12 15 3 29 85 118 5
## 1793-Washington 1 1 7 1 4 1 2 8 12 1
## 1797-Adams 24 52 44 25 24 3 23 160 214 7
## 1801-Jefferson 34 49 60 35 31 7 34 91 116 8
## 1805-Jefferson 26 57 64 27 37 8 34 130 163 11
## 1809-Madison 11 29 37 15 17 1 21 62 82 3
## [ reached max_ndoc ... 52 more documents, reached max_nfeat ... 691 more features ]