Search code examples
rdataframedplyrsum

Sum values in one dataframe based on values of column in another dataframe


I have two dataframes that look like the following

library(dplyr)

State = c('AK','AL','AR','AZ')
Perc = c(0.0023, 0.0034, 0.0043, 0.065)

df1 <- data.frame(State, Perc)

ID = c('A','B','C','D','E')
States = c('AK','AK; AL', 'AK; AL; AR; AZ', 'AR; AZ', 'AL')

df2 <- data.frame(ID, States)

I want to create a new column in df2 that locates the value of Perc in df1 for a particular state and sums up the values for each ID. Ideally, I would like to have a new df2 that looks like this:

ID    States           SumPerc
A     AK               0.0023
B     AK; AL           0.0057
C     AK; AL; AR; AZ   0.0750
D     AR; AZ           0.0693
E     AL               0.0034

How should I go about this process?


Solution

  • First we split df2 into a row for each state, then we join in the Perc data. Then we just need to aggregate back up based on the ID group, and we re-concatenate States and sum up Perc

    library(tidyverse)
    
    df2 %>% separate_rows(States, sep = "; ") %>% 
      left_join(., df1, by = c("States" = "State")) %>% 
      group_by(ID) %>% 
      summarize(
        Perc = sum(as.numeric(Perc), na.rm = TRUE), 
        States = paste(States, collapse= "; ")
      )
    
    
      ID      Perc States        
      <chr>  <dbl> <chr>         
    1 A     0.0023 AK            
    2 B     0.0057 AK; AL        
    3 C     0.075  AK; AL; AR; AZ
    4 D     0.0693 AR; AZ        
    5 E     0.0034 AL