Search code examples
rggplot2dplyrfrequencyanalysis

How to find frequencies of multiple ID's from one column by year and plot?


I have a df that looks like

ID Year
Nation, Nation - NA, Economy, Economy - Asia 2008
Economy, Economy - EU, State, Nation 2009

I would like to extract the frequencies of the ID's so that it looks like

Nation Economy State Year
2 2 0 2008
1 2 1 2009

For ID's that have hyphens like "Economy - EU", I am only interested in counting this as a frequency of "Economy"

My end goal is to plot this df by year with the frequency counts of different ID's in the same plot. So for example, "State" would be a green dot in 2008, "Nation" would be a red dot in 2008, and "Economy" would be a blue dot in 2008.

If the second df is not a good way to do this, I am also open to suggestions! That was just my first thought on how to start this.

I will this post as a separate question if this is not appropriate, but my next question is how to plot the frequencies of the second df by year, as mentioned above?

Thank you!


Solution

  • You can split the data into different rows using separate_rows splitting on a comma (,). Separate the value after - in a different column and calculate occurrence of ID value in each Year and get the data in wide format.

    library(dplyr)
    library(tidyr)
    
    df %>%
      separate_rows(ID, sep = ',\\s*') %>%
      separate(ID, c('ID', 'Value'), sep = '\\s*-\\s*',fill = 'right') %>%
      count(Year, ID) %>%
      pivot_wider(names_from = ID, values_from = n, values_fill = 0)
    
    #   Year Economy Nation State
    #  <int>   <int>  <int> <int>
    #1  2008       2      2     0
    #2  2009       2      1     1
    

    You can also reduce the code by using janitor::tabyl.

    df %>%
      separate_rows(ID, sep = ',\\s*') %>%
      separate(ID, c('ID', 'Value'), sep = '\\s*-\\s*',fill = 'right') %>%
      janitor::tabyl(Year, ID)
    

    data

    df <- structure(list(ID = c("Nation, Nation - NA, Economy, Economy - Asia", 
    "Economy, Economy - EU, State, Nation"), Year = 2008:2009), 
    class = "data.frame", row.names = c(NA, -2L))