Search code examples
rquanteda

Using quantedas tokens_compound to join multi-word expressions via underscore in a tokens object


I have a tokens object in words, without punctuation:

doc text
doc1 'Mohammed' 'Fisher' 'is' 'a' 'great' 'guy' 'He' 'loves' 'fishing'
doc2 'M' 'Fisher' 'likes' 'fishing' 'Fishing' 'yay'

I want to use tokens_compound on this to join certain multi-word expressions via underscore:

doc text
doc1 'Mohammed_Fisher' 'is' 'a' 'great' 'guy' 'He' 'loves' 'fishing'
doc2 'M_Fisher' 'likes' 'fishing' 'Fishing' 'yay'

Therefore, I defined a list of multi-word expressions I want to join and used tokens_compound:

multiword <- c('Mohammed Fisher', 'M Fisher')
comp_toks <- tokens_compound(tokens, pattern = phrase(multiword))

This does not work, neither does

comp_toks <- tokens_compound(tokens, pattern = as.phrase(multiword))

nor

comp_toks <- tokens_compound(tokens, multiword)

What am I missing here?


Solution

  • Use phrase() instead of as.phrase().

    > quanteda::phrase(c('Mohammed Fisher', 'M Fisher'))
    [[1]]
    [1] "Mohammed" "Fisher"  
    
    [[2]]
    [1] "M"      "Fisher"