I have a df that looks like
tags | Year |
---|---|
North America , Economy - Goods , Economy - Spending - Federal | 2008 |
Europe , Economy - Goods , Trading History & Goods | 2009 |
They are not in order and there is a space after the last word before each comma.
"North America" and "Europe" would be considered master tags, and the rest would be sub-tags.
I am trying to find the frequencies of a master tag + a sub-tag so that it would like:
North America , Economy - Goods | Europe , Economy - Goods | Year |
---|---|---|
1 | 0 | 2008 |
0 | 1 | 2009 |
and then plot the frequencies of both columns in the same graph.
From a previous question, I used this code to find frequencies of single tags and plot them:
df <- data.frame(tags = c("North America , Economy - Goods , Economy - Spending - Federal", "Europe , Economy - Goods , Trading History & Goods"), Year = c(2008, 2009))
df1 <- df %>%
separate_rows(tags, sep = ',\\s*') %>%
separate(tags, c('tags', 'Value'), sep = '\\s*-\\s*',fill = 'right') %>%
mutate(tags = trimws(tags)) %>%
count(Year, tags) %>%
pivot_wider(names_from = tags, values_from = n, values_fill = 0)
# Subset for specific tags
df2 <- subset(df1, select = c("Year", "North America",
"Europe"))
# Reshape data frame for ggplot
df3 <- data.frame(x = df2$Year,
y = c(df2$"North America", df2$"Europe"),
group = c(rep("North America", nrow(df2)),
rep("Europe", nrow(df2))))
# Plot
ggplot(df3, aes(x, y, col = group)) +
geom_line()
But I am not sure how to alter this code to do multiple tags combined.
Thank you!
EDIT: Not sure if this helpful to anyone, but posting the solution I used that worked on my df:
df_1 <- df %>%
filter(grepl("\\bNorth America\\b", tags)) %>%
filter(grepl("Economy - Goods", tags)) %>%
group_by(Year) %>%
tally()
df_2 <- df %>%
filter(grepl("Europe", tags)) %>%
filter(grepl("Economy - Goods", tags)) %>%
group_by(Year) %>%
tally()
# merge table
df3 <- full_join(df_1, df_2, by = "Year")
df3[is.na(df3)] <- 0
# plot
dfm <- melt(df3, id.vars = "Year")
p <- ggplot(dfm, aes(x = Year, y = value, colour = variable))
p + geom_line() +
scale_x_continuous(limits = c(1999, 2020), breaks = c(seq(1999,
2020, 1))) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Your example dataset might be too little to be very useful in our answers. Moreover, the plot that you proposed is quite odd, with Year as numeric. If my proposal is not what you are looking for, you might want to draw the plot you want by hand and post it.
Still, here is some solution to what I understood was your problem:
library(tidyverse)
master_tags = c("North America", "Europe")
df1 = df %>%
mutate(master_tag = str_extract(tags, master_tags)) %>%
separate_rows(tags, sep = ',\\s*') %>%
mutate(tags = str_trim(tags)) %>%
filter(!tags %in% master_tags)
ggplot(df1, aes(x=master_tag, fill=tags)) +
geom_histogram(stat="count", position="dodge") +
facet_wrap("Year")
Created on 2021-06-02 by the reprex package (v2.0.0)
Here, we can extract the master tag using stringr::str_extract()
and remove it later when appropriate using dplyr::filter()
. Then, we can directly count using geom_histogram()
.
Obviously, this plot would be much more informative with a larger input dataset.