Search code examples
rparsingtokenize

take tokens from the same line in r programming


using R programming ,i need to take tokens ngram=2 from a file.

the problem is that it combines the lines , and some tokens has part at the end of line and the other part at the start of the next line

Req_tok <-jobs %>% unnest_tokens(ngram,POSITION, token = "ngrams", n = 2)

in the file jobs i have the first two lines:

it architect

it helpdesk support agents

i get tokens like:

it architect
architect it
it helpdesk
and so on ....

what to do in order not to get tokens like "architect it"

i want to tokenize every line separately


Solution

  • Just add collapse = FALSE in your unnest_tokens:

    library(tidytext)
    library(dplyr)
    
    jobs %>% 
      unnest_tokens(ngram, POSITION, token = "ngrams", n = 2, collapse = FALSE)
    

    Result:

                   ngram
    1       it architect
    2        it helpdesk
    2.1 helpdesk support
    2.2   support agents
    

    Remember to convert your string vector to character if it is a factor variable, otherwise unnest_token will throw you an error.

    Data:

    jobs = data.frame(POSITION = c("it architect", "it helpdesk support agents"), stringsAsFactors = FALSE)