Search code examples
rnlpsparse-matrixquanteda

join quanteda dfm top ten 1grams with all dfm 2 thru 5grams


To conserve memory space when dealing with a very large corpus sample i'm looking to take just the top 10 1grams and combine those with all of the 2 thru 5grams to form my single quanteda::dfmSparse object that will be used in natural language processing [nlp] predictions. Carrying around all the 1grams will be pointless because only the top ten [ or twenty ] will ever get used with the simple back off model i'm using.

I wasn't able to find a quanteda::dfm(corpusText, . . .) parameter that instructs it to only return the top ## features. So based on comments from package author @KenB in other threads i'm using the dfm_select/remove functions to extract the top ten 1grams and based on the "quanteda dfm join" search results hit "concatenate dfm matrices in 'quanteda' package" i'm using rbind.dfmSparse??? function to join those results.

So far everything looks right from what i can tell. Thought i'd bounce this game plan off of SO community to see if i'm overlooking a more efficient route to arrive at this result or some flaw in solution I've arrived at thus far.

corpusObject <- quanteda::corpus(paste("some corpus text of no consequence that in practice is going to be very large\n",
    "and so one might expect a very large number of ngrams but for nlp purposes only care about top ten\n",
    "adding some corpus text word repeats to ensure 1gram top ten selection approaches are working\n"))
corpusObject$documents
dfm1gramsSorted <- dfm_sort(dfm(corpusObject, tolower = T, stem = F, ngrams = 1))
dfm2to5grams <- quanteda::dfm(corpusObject, tolower = T, stem = F, ngrams = 2:5)
dfm1gramsSorted; dfm2to5grams 
#featnames(dfm1gramsSorted); featnames(dfm2to5grams)
#colSums(dfm1gramsSorted); colSums(dfm2to5grams)

dfm1gramsSortedLen <- length(featnames(dfm1gramsSorted))
# option1 - select top 10 features from dfm1gramsSorted
dfmTopTen1grams <- dfm_select(dfm1gramsSorted, pattern = featnames(dfm1gramsSorted)[1:10]) 
dfmTopTen1grams; featnames(dfmTopTen1grams)
# option2 - drop all but top 10 features from dfm1gramsSorted
dfmTopTen1grams <- dfm_remove(dfm1gramsSorted, pattern = featnames(dfm1gramsSorted)[11:dfm1gramsSortedLen]) 
dfmTopTen1grams; featnames(dfmTopTen1grams)

dfmTopTen1gramsAndAll2to5grams <- rbind(dfmTopTen1grams, dfm2to5grams)
dfmTopTen1gramsAndAll2to5grams;
#featnames(dfmTopTen1gramsAndAll2to5grams); colSums(dfmTopTen1gramsAndAll2to5grams)
data.table(ngram = featnames(dfmTopTen1gramsAndAll2to5grams)[1:50], frequency = colSums(dfmTopTen1gramsAndAll2to5grams)[1:50],
keep.rownames = F, stringsAsFactors = F)

/eoq


Solution

  • For extracting the top 10 unigrams, this strategy will work just fine:

    1. sort the dfm by the (default) decreasing order of overall feature frequency, which you have already done, but then add a step tp slice out the first 10 columns.

    2. combine this with the 2- to 5-gram dfm using cbind() (not rbind())).

    That should do it:

    dfmCombined <- cbind(dfm1gramsSorted[, 1:10], dfm2to5grams)
    head(dfmCombined, nfeat = 15)
    # Document-feature matrix of: 1 document, 195 features (0% sparse).
    # (showing first document and first 15 features)
    #        features
    # docs    some corpus text of to very large top ten no some_corpus corpus_text text_of of_no no_consequence
    #   text1    2      2    2  2  2    2     2   2   2  1           2           2       1     1              1
    

    Your example code includes some use of data.table, although this does not appear in the question. In v0.99 we have added a new function textstat_frequency() which produces a "long"/"tidy" format of frequencies in a data.frame that might be helpful:

    head(textstat_frequency(dfmCombined), 10)
    #        feature frequency rank docfreq
    # 1         some         2    1       1
    # 2       corpus         2    2       1
    # 3         text         2    3       1
    # 4           of         2    4       1
    # 5           to         2    5       1
    # 6         very         2    6       1
    # 7        large         2    7       1
    # 8          top         2    8       1
    # 9          ten         2    9       1
    # 10 some_corpus         2   10       1