I have a tokens
object in words, without punctuation:
doc | text |
---|---|
doc1 | 'Mohammed' 'Fisher' 'is' 'a' 'great' 'guy' 'He' 'loves' 'fishing' |
doc2 | 'M' 'Fisher' 'likes' 'fishing' 'Fishing' 'yay' |
I want to use tokens_compound
on this to join certain multi-word expressions via underscore:
doc | text |
---|---|
doc1 | 'Mohammed_Fisher' 'is' 'a' 'great' 'guy' 'He' 'loves' 'fishing' |
doc2 | 'M_Fisher' 'likes' 'fishing' 'Fishing' 'yay' |
Therefore, I defined a list of multi-word expressions I want to join and used tokens_compound
:
multiword <- c('Mohammed Fisher', 'M Fisher')
comp_toks <- tokens_compound(tokens, pattern = phrase(multiword))
This does not work, neither does
comp_toks <- tokens_compound(tokens, pattern = as.phrase(multiword))
nor
comp_toks <- tokens_compound(tokens, multiword)
What am I missing here?
Use phrase()
instead of as.phrase()
.
> quanteda::phrase(c('Mohammed Fisher', 'M Fisher'))
[[1]]
[1] "Mohammed" "Fisher"
[[2]]
[1] "M" "Fisher"