I looked into the documentation, but as far as I understand, there is now way to use the textstat_simil
function with a dictionary or globs. What would be the best way of approaching something like the below?
txt <- "It is raining. It rains a lot during the rainy season"
rain_dfm <- dfm(txt)
textstat_simil(rain_dfm, "rain", method = "cosine", margin = "features")
Do I need to use tokens_replace
to change "rain*" to "rain", or is there another way to do this? In this case, stemming would do the trick, but what about cases where that is not feasible?
It's possible, but first you would need to convert the glob matches with "rain*" into "rain" by using dfm_lookup()
. (Note: there are other ways to do this, such as tokenizing and then using tokens_lookup()
, or tokens_replace()
, but I think the lookup approach is more straightforward and this is also what you asked in the question.
Also note that for feature similarity, you must have more than a single document, which explains why I added two more here.
txt <- c("It is raining. It rains a lot during the rainy season",
"Raining today, and it rained yesterday.",
"When it's raining it must be rainy season.")
rain_dfm <- dfm(txt)
Then use a dictionary to convert glob matches (the default) with "rain*" to "rain", while keeping the other features. (In this particular case, you are correct that dfm_wordstem()
could have accomplished the same thing.)
rain_dfm <- dfm_lookup(rain_dfm,
dictionary(list(rain = "rain*")),
exclusive = FALSE,
capkeys = FALSE)
rain_dfm
## Document-feature matrix of: 3 documents, 17 features (52.9% sparse).
## 3 x 17 sparse Matrix of class "dfm"
## features
## docs it is rain . a lot during the season today , and yesterday when it's must be
## text1 2 1 3 1 1 1 1 1 1 0 0 0 0 0 0 0 0
## text2 1 0 2 1 0 0 0 0 0 1 1 1 1 0 0 0 0
## text3 1 0 2 1 0 0 0 0 1 0 0 0 0 1 1 1 1
And now, you can compute the cosine similarity for the target feature of "rain":
textstat_simil(rain_dfm, selection = "rain", method = "cosine", margin = "features")
## rain
## it 0.9901475
## is 0.7276069
## rain 1.0000000
## . 0.9801961
## a 0.7276069
## lot 0.7276069
## during 0.7276069
## the 0.7276069
## season 0.8574929
## today 0.4850713
## , 0.4850713
## and 0.4850713
## yesterday 0.4850713
## when 0.4850713
## it's 0.4850713
## must 0.4850713
## be 0.4850713