Search code examples
rquanteda

Using textstat_simil with a dictionary or globs in Quanteda


I looked into the documentation, but as far as I understand, there is now way to use the textstat_simil function with a dictionary or globs. What would be the best way of approaching something like the below?

txt <- "It is raining. It rains a lot during the rainy season"
rain_dfm <- dfm(txt)
textstat_simil(rain_dfm, "rain", method = "cosine", margin = "features")

Do I need to use tokens_replace to change "rain*" to "rain", or is there another way to do this? In this case, stemming would do the trick, but what about cases where that is not feasible?


Solution

  • It's possible, but first you would need to convert the glob matches with "rain*" into "rain" by using dfm_lookup(). (Note: there are other ways to do this, such as tokenizing and then using tokens_lookup(), or tokens_replace(), but I think the lookup approach is more straightforward and this is also what you asked in the question.

    Also note that for feature similarity, you must have more than a single document, which explains why I added two more here.

    txt <- c("It is raining. It rains a lot during the rainy season",
             "Raining today, and it rained yesterday.",
             "When it's raining it must be rainy season.")
    
    rain_dfm <- dfm(txt)
    

    Then use a dictionary to convert glob matches (the default) with "rain*" to "rain", while keeping the other features. (In this particular case, you are correct that dfm_wordstem() could have accomplished the same thing.)

    rain_dfm <- dfm_lookup(rain_dfm, 
                           dictionary(list(rain = "rain*")), 
                           exclusive = FALSE,
                           capkeys = FALSE)
    rain_dfm
    ## Document-feature matrix of: 3 documents, 17 features (52.9% sparse).
    ## 3 x 17 sparse Matrix of class "dfm"
    ##        features
    ## docs    it is rain . a lot during the season today , and yesterday when it's must be
    ##   text1  2  1    3 1 1   1      1   1      1     0 0   0         0    0    0    0  0
    ##   text2  1  0    2 1 0   0      0   0      0     1 1   1         1    0    0    0  0
    ##   text3  1  0    2 1 0   0      0   0      1     0 0   0         0    1    1    1  1
    

    And now, you can compute the cosine similarity for the target feature of "rain":

    textstat_simil(rain_dfm, selection = "rain", method = "cosine", margin = "features")
    ##                rain
    ## it        0.9901475
    ## is        0.7276069
    ## rain      1.0000000
    ## .         0.9801961
    ## a         0.7276069
    ## lot       0.7276069
    ## during    0.7276069
    ## the       0.7276069
    ## season    0.8574929
    ## today     0.4850713
    ## ,         0.4850713
    ## and       0.4850713
    ## yesterday 0.4850713
    ## when      0.4850713
    ## it's      0.4850713
    ## must      0.4850713
    ## be        0.4850713