I have a data frame look like this :
date | text |
---|---|
201901 | Thank you for helping me |
201902 | You are amazing |
201902 | For helping with this |
My aim is to calculate the word frequency in each line, and eventually look like this:
date | thank | you | for | helping | me | are | amazing | with | this | for |
---|---|---|---|---|---|---|---|---|---|---|
201901 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
201902 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 |
The actual data set is like this frame, but contains millions of text lines. So I was wondering how to automate this process using R, without typing all those texts lines.
Using R and tidyverse:
df <- data.frame(date = c(201901, 201902, 201902),
text = c("Thank you for helping me", "You are amazing", "For helping with this"))
library(tidyverse)
If you want your data as a table of counts
df %>%
separate_rows(text, sep = " ") %>%
mutate(text = tolower(text)) %>%
table()
Output:
text
date amazing are for helping me thank this with you
201901 0 0 1 1 1 1 0 0 1
201902 1 1 1 1 0 0 1 1 1
If you want your output as a tibble
df %>%
separate_rows(text, sep = " ") %>%
mutate(text = tolower(text)) %>%
table() %>%
as_tibble() %>%
pivot_wider(names_from = text, values_from = n)
Output:
# A tibble: 2 x 10
date amazing are `for` helping me thank this with you
<chr> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 201901 0 0 1 1 1 1 0 0 1
2 201902 1 1 1 1 0 0 1 1 1
edit: To transform everything to lowercase as your desired output and to show you the output
edit2: To show you that you can also get your data as a tibble to further work with it