Search code examples
rnlpn-gramquanteda

How do I keep intra-word periods in unigrams? R quanteda


I would like to preserve two letter acronyms in my unigram frequency table that are separated by periods such as "t.v." and "u.s.". When I build my unigram frequency table with quanteda, the teminating period is getting truncated. Here is a small test corpus to illustrate. I have removed periods as sentence separators:

SOS This is the u.s. where our politics is crazy EOS

SOS In the US we watch a lot of t.v. aka TV EOS

SOS TV is an important part of life in the US EOS

SOS folks outside the u.s. probably don't watch so much t.v. EOS

SOS living in other countries is probably not any less crazy EOS

SOS i enjoy my sanity when it comes to visit EOS

which I load into R as character vector:

acro.test <- c("SOS This is the u.s. where our politics is crazy EOS", "SOS In the US we watch a lot of t.v. aka TV EOS", "SOS TV is an important part of life in the US EOS", "SOS folks outside the u.s. probably don't watch so much t.v. EOS", "SOS living in other countries is probably not any less crazy EOS", "SOS i enjoy my sanity when it comes to visit EOS")

Here is the code I use to build my unigram frequency table:

library(quanteda)
dat.dfm <- dfm(acro.test, ngrams=1, verbose=TRUE, concatenator=" ",  toLower=FALSE, removeNumbers=TRUE, removePunct=FALSE, stopwords=FALSE)
dat.mat <- as.data.frame(as.matrix(docfreq(dat.dfm)))
ng.sorted <- sort(rowSums(dat.mat), decreasing=TRUE)
freqTable <- data.frame(ngram=names(ng.sorted), frequency = ng.sorted)
row.names(freqTable) <- NULL
freqTable

This produces the following:

       ngram frequency
1        SOS         6
2        EOS         6
3        the         4
4         is         3
5          .         3
6        u.s         2
7      crazy         2
8         US         2
9      watch         2
10        of         2
11       t.v         2
12        TV         2
13        in         2
14  probably         2
15      This         1
16     where         1
17       our         1
18  politics         1
19        In         1
20        we         1
21         a         1
22       lot         1
23       aka         1

etc...

I would like to keep the terminal periods on t.v. and u.s. as well as eliminate the entry in the table for . with a frequency of 3.

I also don't understand why the period (.) would have a count of 3 in this table while counting the u.s and t.v unigrams correctly (2 each).


Solution

  • The reason for this behaviour is that quanteda's default word tokeniser uses the ICU-based definition for word boundaries (from the stringi package). u.s. appears as the word u.s. followed by a period . token. This is great if your name is will.i.am but maybe not so great for your purposes. But you can easily switch to the white-space tokeniser, using the argument what = "fasterword" passed to tokens(), an option available in dfm() through the ... part of the function call.

    tokens(acro.test, what = "fasterword")[[1]]
    ## [1] "SOS"      "This"     "is"       "the"      "u.s."     "where"    "our"      "politics" "is"       "crazy"    "EOS" 
    

    You can see that here, u.s. is preserved. In response to your last question, the terminal . had a document frequency of 3 because it appeared in three documents as a separate token, which is the default word tokeniser behaviour when remove_punct = FALSE.

    To pass this through to dfm() and then construct your data.frame of the document frequency of the words, the following code works (I've tidied it up a bit for efficiency). Note the comment about the difference between document and term frequency - I've noted that some users are a bit confused about docfreq().

    # I removed the options that were the same as the default 
    # note also that stopwords = TRUE is not a valid argument - see remove parameter
    dat.dfm <- dfm(acro.test, tolower = FALSE, remove_punct = FALSE, what = "fasterword")
    
    # sort in descending document frequency
    dat.dfm <- dat.dfm[, names(sort(docfreq(dat.dfm), decreasing = TRUE))]
    # Note: this would sort the dfm in descending total term frequency
    #       not the same as docfreq
    # dat.dfm <- sort(dat.dfm)
    
    # this creates the data.frame in one more efficient step
    freqTable <- data.frame(ngram = featnames(dat.dfm), frequency = docfreq(dat.dfm),
                            row.names = NULL, stringsAsFactors = FALSE)
    head(freqTable, 10)
    ##    ngram frequency
    ## 1    SOS         6
    ## 2    EOS         6
    ## 3    the         4
    ## 4     is         3
    ## 5   u.s.         2
    ## 6  crazy         2
    ## 7     US         2
    ## 8  watch         2
    ## 9     of         2
    ## 10  t.v.         2
    

    In my view the named vector produced by docfreq() on the dfm is a more efficient method for storing the results than your data.frame approach, but you may wish to add other variables.