Search code examples
rregexquanteda

Look ahead and look behind not working for quanteda dictionary


I am trying to set up a quanteda dictionary which contains many overlapping terms. I believe using regex look ahead/look behind could be a way to solve this and avoid false hits, but I must be doing something wrong.

text <- c("guinea", "equatorial guinea", "guinea bissau")
dict <- dictionary(list(guinea="guinea"))
dfm <- dfm(text, dictionary=dict, valuetype="regex")
colSums(dfm)              
dict2 <- dictionary(list(guinea="(?<!equatorial[[:space:]])guinea"))
dfm2 <- dfm(text, dictionary=dict2, valuetype="regex")
colSums(dfm2)
dict3 <- dictionary(list(guinea="guinea(?![[:space:]]bissau)"))
dfm3 <- dfm(text, dictionary=dict3, valuetype="regex")
colSums(dfm3)

Expected results should be

# dfm1
colSums(dfm1)
guinea 
     3 
# dfm2
colSums(dfm2)
guinea 
     2
# dfm3 
colSums(dfm3)
guinea 
     2 

But actual results are all = 3 Is this an issue with the look ahead/behind or with how the blank space is inserted?


Solution

  • This sort of regex matching does not work because patterns cannot span multiple tokens, and in the dfm(x, dictionary = ...) call, it is actually calling tokens_lookup() after tokenizing the text.

    There is a much easier way to do this, which is simply to include the multi-word values in your dictionary. So:

    library("quanteda")
    ## Package version: 1.4.3
    
    text <- c("guinea", "equatorial guinea", "guinea bissau")
    
    dict <- dictionary(list(guinea = "guinea"))
    dict2 <- dictionary(list(guinea = "equatorial guinea"))
    dict3 <- dictionary(list(guinea = "guinea bissau"))
    
    dfm(text, dictionary = dict)
    ## Document-feature matrix of: 3 documents, 1 feature (0.0% sparse).
    ## 3 x 1 sparse Matrix of class "dfm"
    ##        features
    ## docs    guinea
    ##   text1      1
    ##   text2      1
    ##   text3      1
    
    dfm(text, dictionary = dict2)
    ## Document-feature matrix of: 3 documents, 1 feature (66.7% sparse).
    ## 3 x 1 sparse Matrix of class "dfm"
    ##        features
    ## docs    guinea
    ##   text1      0
    ##   text2      1
    ##   text3      0
    
    dfm(text, dictionary = dict3)
    ## Document-feature matrix of: 3 documents, 1 feature (66.7% sparse).
    ## 3 x 1 sparse Matrix of class "dfm"
    ##        features
    ## docs    guinea
    ##   text1      0
    ##   text2      0
    ##   text3      1