using R programming ,i need to take tokens ngram=2 from a file.
the problem is that it combines the lines , and some tokens has part at the end of line and the other part at the start of the next line
Req_tok <-jobs %>% unnest_tokens(ngram,POSITION, token = "ngrams", n = 2)
in the file jobs i have the first two lines:
it architect
it helpdesk support agents
i get tokens like:
it architect
architect it
it helpdesk
and so on ....
what to do in order not to get tokens like "architect it"
i want to tokenize every line separately
Just add collapse = FALSE
in your unnest_tokens
:
library(tidytext)
library(dplyr)
jobs %>%
unnest_tokens(ngram, POSITION, token = "ngrams", n = 2, collapse = FALSE)
Result:
ngram
1 it architect
2 it helpdesk
2.1 helpdesk support
2.2 support agents
Remember to convert your string vector to character if it is a factor variable, otherwise unnest_token
will throw you an error.
Data:
jobs = data.frame(POSITION = c("it architect", "it helpdesk support agents"), stringsAsFactors = FALSE)