Search code examples
rggplot2dplyrfrequencyanalysis

Find frequencies of multiple strings combined and plot?


I have a df that looks like

tags Year
North America , Economy - Goods , Economy - Spending - Federal 2008
Europe , Economy - Goods , Trading History & Goods 2009

They are not in order and there is a space after the last word before each comma.

"North America" and "Europe" would be considered master tags, and the rest would be sub-tags.

I am trying to find the frequencies of a master tag + a sub-tag so that it would like:

North America , Economy - Goods Europe , Economy - Goods Year
1 0 2008
0 1 2009

and then plot the frequencies of both columns in the same graph.

From a previous question, I used this code to find frequencies of single tags and plot them:

df <- data.frame(tags = c("North America , Economy - Goods , Economy - Spending - Federal", "Europe , Economy - Goods , Trading History & Goods"), Year = c(2008, 2009))

df1 <- df %>%
  separate_rows(tags, sep = ',\\s*') %>%
  separate(tags, c('tags', 'Value'), sep = '\\s*-\\s*',fill = 'right') %>%
  mutate(tags = trimws(tags)) %>% 
  count(Year, tags) %>%
  pivot_wider(names_from = tags, values_from = n, values_fill = 0)

# Subset for specific tags 
    df2 <- subset(df1, select = c("Year", "North America", 
    "Europe"))

# Reshape data frame for ggplot
    df3 <- data.frame(x = df2$Year,                            
                      y = c(df2$"North America", df2$"Europe"),
                      group = c(rep("North America", nrow(df2)),
                              rep("Europe", nrow(df2))))
# Plot
ggplot(df3, aes(x, y, col = group)) +            
   geom_line() 

But I am not sure how to alter this code to do multiple tags combined.

Thank you!

EDIT: Not sure if this helpful to anyone, but posting the solution I used that worked on my df:

df_1 <- df %>%
  filter(grepl("\\bNorth America\\b", tags)) %>%
  filter(grepl("Economy - Goods", tags)) %>%
  group_by(Year) %>% 
tally()


df_2 <- df %>%
  filter(grepl("Europe", tags)) %>%
  filter(grepl("Economy - Goods", tags)) %>%
  group_by(Year) %>% 
    tally()


# merge table
df3 <- full_join(df_1, df_2, by = "Year")
df3[is.na(df3)] <- 0

# plot
dfm <- melt(df3, id.vars = "Year")

p <- ggplot(dfm, aes(x = Year, y = value, colour = variable))
p + geom_line() +
  scale_x_continuous(limits = c(1999, 2020), breaks = c(seq(1999,    
2020, 1))) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Solution

  • Your example dataset might be too little to be very useful in our answers. Moreover, the plot that you proposed is quite odd, with Year as numeric. If my proposal is not what you are looking for, you might want to draw the plot you want by hand and post it.

    Still, here is some solution to what I understood was your problem:

    library(tidyverse)
      
    master_tags = c("North America", "Europe")
    df1 = df %>%
      mutate(master_tag = str_extract(tags, master_tags)) %>% 
      separate_rows(tags, sep = ',\\s*') %>% 
      mutate(tags = str_trim(tags)) %>% 
      filter(!tags %in% master_tags)
    
    ggplot(df1, aes(x=master_tag, fill=tags)) + 
      geom_histogram(stat="count", position="dodge") + 
      facet_wrap("Year")
    

    Created on 2021-06-02 by the reprex package (v2.0.0)

    Here, we can extract the master tag using stringr::str_extract() and remove it later when appropriate using dplyr::filter(). Then, we can directly count using geom_histogram().

    Obviously, this plot would be much more informative with a larger input dataset.