I'd like to create a plot from the Textmining with R web textbook, but with my data. It essentially searches for the top terms per year and graphs them (Figure 5.4: http://tidytextmining.com/dtm.html). My data is a bit cleaner than the one they started with, but I'm new to R. My data has a "Date" column that is in 2016-01-01 format (it's a date class). I only have data from 2016, so I want to do the same thing, but more granular, (i.e. by month or by day)
library(tidyr)
year_term_counts <- inaug_td %>%
extract(document, "year", "(\\d+)", convert = TRUE) %>%
complete(year, term, fill = list(count = 0)) %>%
group_by(year) %>%
mutate(year_total = sum(count))
year_term_counts %>%
filter(term %in% c("god", "america", "foreign", "union", "constitution",
"freedom")) %>%
ggplot(aes(year, count / year_total)) +
geom_point() +
geom_smooth() +
facet_wrap(~ term, scales = "free_y") +
scale_y_continuous(labels = scales::percent_format()) +
ylab("% frequency of word in inaugural address")
The idea is that I would chose my specific words from my text and see how they change over the months.
Thank you!
If you want look at smaller units of time, based on a date column that you already have, I would recommend looking at the floor_date()
or round_date()
function from lubridate. The particular chapter of our book you linked to deals with taking a document-term matrix and then tidying it, etc. Have you already gotten to a tidy text format for your data? If so, then you could do something like this:
date_counts <- tidy_text %>%
mutate(date = floor_date(Date, unit = "7 days")) %>% # use whatever time unit you want here
count(date, word) %>%
group_by(date) %>%
mutate(date_total = sum(n))
date_counts %>%
filter(word %in% c("PUT YOUR LIST OF WORDS HERE")) %>%
ggplot(aes(date, n / date_total)) +
geom_point() +
geom_smooth() +
facet_wrap(~ word, scales = "free_y")