I'm dealing with a text file that has words like "c#", "c++", and ".net" from jobs ads. When I convert it into tokens, the "#" , "++", and the dot are removed. How can I keep them in the resulting tokens? Here is my code:
unnest_tokens(word,REQUIREMENTS, token = "words",to_lower=TRUE)
The problem is the argument token = "words"
, which splits on non-word characters (presumably using the regex \\W+
). This function throws away the delimiters, so in order to keep those characters, you will have to use some other argument than "words"
. You might want to define your own splitting regex with token = "regex"
and something like this:
unnest_tokens(word,
REQUIREMENTS,
token = "regex",
to_lower = TRUE,
pattern = "\\s+") # split on whitespace rather than non-word elements
This way, you can define whatever regex you need to customize how the text is tokenized.