Search code examples
rdata-miningtokenize

How to keep special symbols like "(" "," and "#" in tokens in R?


I'm dealing with a text file that has words like "c#", "c++", and ".net" from jobs ads. When I convert it into tokens, the "#" , "++", and the dot are removed. How can I keep them in the resulting tokens? Here is my code:

unnest_tokens(word,REQUIREMENTS, token = "words",to_lower=TRUE)

Solution

  • The problem is the argument token = "words", which splits on non-word characters (presumably using the regex \\W+). This function throws away the delimiters, so in order to keep those characters, you will have to use some other argument than "words". You might want to define your own splitting regex with token = "regex" and something like this:

    unnest_tokens(word,
                  REQUIREMENTS,
                  token = "regex",
                  to_lower = TRUE,
                  pattern = "\\s+") # split on whitespace rather than non-word elements
    

    This way, you can define whatever regex you need to customize how the text is tokenized.