Search code examples
rnlptext-miningtmquanteda

How to keep the beginning and end of sentence markers with quanteda


I'm trying to create 3-grams using R's quanteda package.

I'm struggling to find a way to keep in the n-grams beginning and end of sentence markers, the <s> and </s> as in the code below.

I thought that using the keptFeatures with a regular expression that matched those should maintain them but the chevron markers are always removed.

How can I keep the chevron markers from being removed or what is the best way to delimit beginning and end of sentence with quanteda?

As a bonus question what is the advantage of docfreq(mydfm) over colSums(mydfm), the result of str(colSums(mydfm)) and str(docfreq(mydfm)) is almost identical (Named num [1:n] the former, Named int [1:n] the latter)?

library(quanteda)
text <- "<s>I'm a sentence and I'd better be formatted properly!</s><s>I'm a second sentence</s>"

qc <- corpus(text)

mydfm  <- dfm(qc, ngram=3, removeNumbers = F, stem=T, keptFeatures="\\</?s\\>")

names(colSums(mydfm))

# Output:
# [1] "s_i'm_a"    "i'm_a_sentenc"    "a_sentenc_and"    "sentenc_and_i'd"
# [2] "and_i'd_better"   "i'd_better_be"    "better_be_format"   
# [3] "be_format_proper" "format_proper_s"  "proper_s_s"   "s_s_i'm"    
# [4] "i'm_a_second"   "a_second_sentenc"   "second_sentenc_s"

EDIT:

Corrected keepFeatures to keptFeatures in code snippet.


Solution

  • To return a simple vector, just unlist the tokenizedText" object returned fromtokenize()(which is a specially classed list, with additional attributes). Here I used thewhat = "fasterword"which splits on "\\s" -- it's a tiny bit smarter thanwhat = "fastestword"which splits on" "`.

    # how to not remove the <s>, and return a vector 
    unlist(toks <- tokenize(text, ngrams = 3, what = "fasterword"))
    ## [1] "<s>I'm_a_sentence"                "a_sentence_and"                  
    ## [3] "sentence_and_I'd"                 "and_I'd_better"                  
    ## [5] "I'd_better_be"                    "better_be_formatted"             
    ## [7] "be_formatted_properly!</s><s>I'm" "formatted_properly!</s><s>I'm_a" 
    ## [9] "properly!</s><s>I'm_a_second"     "a_second_sentence</s>" 
    

    To keep it within sentence, tokenise the object twice, the first time by sentence, the second time by fasterword.

    # keep it within sentence
    (sents <- unlist(tokenize(text, what = "sentence")))
    ## [1] "<s>I'm a sentence and I'd better be formatted properly!"
    ## [2] "</s><s>I'm a second sentence</s>" 
    tokenize(sents, ngrams = 3, what = "fasterword")
    ## tokenizedText object from 2 documents.
    ## Component 1 :
    ## [1] "<s>I'm_a_sentence"      "a_sentence_and"         "sentence_and_I'd"       "and_I'd_better"        
    ## [5] "I'd_better_be"          "better_be_formatted"    "be_formatted_properly!"
    ## 
    ## Component 2 :
    ## [1] "</s><s>I'm_a_second"   "a_second_sentence</s>"
    

    To preserve the chevron markers in a dfm, you can pass through the same options that you used above in the tokenize() call, since dfm() calls tokenize() but with different defaults -- it uses the ones most users will probably want, whereas tokenize() is much more conservative.

    # Bonus questions:
    myDfm <- dfm(text, verbose = FALSE, what = "fasterword", removePunct = FALSE)
    # "chevron" markers are not removed
    features(myDfm)
    ## [1] "<s>i'm"              "a"                   "sentence"            "and"                 "i'd"                
    ## [6] "better"              "be"                  "formatted"           "properly!</s><s>i'm" "second"             
    ## [11] "sentence</s>" 
    

    Final part of the bonus question was the difference between docfreq() and colSums(). The former returns the count of documents in which a term occurs, the latter sums the columns to get a total term frequency across documents. See below how different these are for the term "representatives".

    # Difference between docfreq() and colSums():
    myDfm2 <- dfm(inaugTexts[1:4], verbose = FALSE)
    myDfm2[, "representatives"]
    docfreq(myDfm2)["representatives"]
    colSums(myDfm2)["representatives"]
    ## Document-feature matrix of: 4 documents, 1 feature.
    ## 4 x 1 sparse Matrix of class "dfmSparse"
    ##                  features
    ## docs              representatives
    ##   1789-Washington               2
    ##   1793-Washington               0
    ##   1797-Adams                    2
    ##   1801-Jefferson                0
    docfreq(myDfm2)["representatives"]
    ## representatives 
    ##               2 
    colSums(myDfm2)["representatives"]
    ## representatives 
    ##               4 
    

    Update: Some commands and behaviours have changed in quanteda v0.9.9:

    Return a simple vector, retaining chevrons:

    as.character(toks <- tokens(text, ngrams = 3, what = "fasterword"))
    #  [1] "<s>I'm_a_sentence"                "a_sentence_and"                   "sentence_and_I'd"                
    #  [4] "and_I'd_better"                   "I'd_better_be"                    "better_be_formatted"             
    #  [7] "be_formatted_properly!</s><s>I'm" "formatted_properly!</s><s>I'm_a"  "properly!</s><s>I'm_a_second"    
    # [10] "a_second_sentence</s>" 
    

    Keeping within sentence:

    (sents <- as.character(tokens(text, what = "sentence")))
    # [1] "<s>I'm a sentence and I'd better be formatted properly!" "</s><s>I'm a second sentence</s>"                       
    tokens(sents, ngrams = 3, what = "fasterword")
    # tokens from 2 documents.
    # Component 1 :
    # [1] "<s>I'm_a_sentence"      "a_sentence_and"         "sentence_and_I'd"       "and_I'd_better"         "I'd_better_be"         
    # [6] "better_be_formatted"    "be_formatted_properly!"
    # 
    # Component 2 :
    # [1] "</s><s>I'm_a_second"   "a_second_sentence</s>"
    

    Bonus question part 1:

    featnames(dfm(text, verbose = FALSE, what = "fasterword", removePunct = FALSE))
    #  [1] "<s>i'm"              "a"                   "sentence"            "and"                 "i'd"                
    #  [6] "better"              "be"                  "formatted"           "properly!</s><s>i'm" "second"             
    # [11] "sentence</s>"
    

    Bonus question part 2 is unchanged.