Search code examples
rtidyrtidytext

counting words in "lines" tokens


I'm completely new in R, so this question may seem obvious. However, I didn't manage and didn't find solution

How can I count number of words within my tokens while they are lines (reviews, actually)? So, there is a dataset with reviews(reviewText) connected with ID of products(asin)

amazonr_tidy_sent = amazonr_tidy_sent%>%unnest_tokens(word, reviewText, token = "lines") amazonr_tidy_sent = amazonr_tidy_sent %>% anti_join(stop_words)%>%ungroup()

I tried to do in the following way

wordcounts <- amazonr_tidy_sent %>% group_by(word, asin)%>% summarize(word = n())

but it was not appropriate. I assume, that there is no way to count because line as a token cannot be "separated"

Thanks a lot


Solution

  • You can use unnest_tokens() more than once, if it is appropriate to your analysis.

    First, you can use unnest_tokens() to get the lines that you want. Notice that I am adding a column to keep track of the id of each line; you could call that whatever you want, but the important thing is to have a column that will note which line you are on.

    library(tidytext)
    library(dplyr)
    library(janeaustenr)
    
    
    d <- data_frame(txt = prideprejudice)
    
    d_lines <- d %>%
        unnest_tokens(line, txt, token = "lines") %>%
        mutate(id = row_number())
    
    d_lines
    
    #> # A tibble: 10,721 × 2
    #>                                                                        line
    #>                                                                       <chr>
    #>  1                                                      pride and prejudice
    #>  2                                                           by jane austen
    #>  3                                                                chapter 1
    #>  4  it is a truth universally acknowledged, that a single man in possession
    #>  5                            of a good fortune, must be in want of a wife.
    #>  6   however little known the feelings or views of such a man may be on his
    #>  7 first entering a neighbourhood, this truth is so well fixed in the minds
    #>  8 of the surrounding families, that he is considered the rightful property
    #>  9                                 of some one or other of their daughters.
    #> 10 "my dear mr. bennet," said his lady to him one day, "have you heard that
    #> # ... with 10,711 more rows, and 1 more variables: id <int>
    

    Now you can use unnest_tokens() again, but this time with words so that you will get a row for each word. Notice that you still know which line each word came from.

    d_words <- d_lines %>%
        unnest_tokens(word, line, token = "words")
    
    d_words
    #> # A tibble: 122,204 × 2
    #>       id      word
    #>    <int>     <chr>
    #>  1     1     pride
    #>  2     1       and
    #>  3     1 prejudice
    #>  4     2        by
    #>  5     2      jane
    #>  6     2    austen
    #>  7     3   chapter
    #>  8     3         1
    #>  9     4        it
    #> 10     4        is
    #> # ... with 122,194 more rows
    

    Now you can do any kind of counting you want, for example, maybe you want to know how many words each line had in it?

    d_words %>%
        count(id)
    
    #> # A tibble: 10,715 × 2
    #>       id     n
    #>    <int> <int>
    #>  1     1     3
    #>  2     2     3
    #>  3     3     2
    #>  4     4    12
    #>  5     5    11
    #>  6     6    15
    #>  7     7    13
    #>  8     8    11
    #>  9     9     8
    #> 10    10    15
    #> # ... with 10,705 more rows