Search code examples
rdataframesumaggregateweighted-average

How to aggregate two different columns with two different functions in R dataframe


I have a data frame that has some records that are duplicated and I need to aggregate the duplicates so there is a unique record per row.

An example:

Col1    Col2    Col3    Col4
A       0.170   83     0.878
B       0.939   103    0.869
C       0.228   80     0.935
D       0.566   169    0.851
D       0.566   137    0.588
E       0.703   103    0.636

I need to weight the average of Col4 with Col3, and sum Col3. So my result would be:

Col1    Col2    Col3    Col4
A      0.17     83     0.878
B      0.939    103    0.869
C      0.228    80     0.935
D      0.566    306    0.733
E      0.703    103    0.636

Usually I would use the aggregate function but I can't seem to find a solution to include two different function types. Is there another way I can accomplish this? I am effectively ignoring Col 2 since the granularity before merging with the data that brought in Col3 and Col4 was one record per row, and now it is being duplicated.

Thank you!!


Solution

  • Using dplyr, you can use group_by to keep all unique rows of "Col1" and then pass all your different function into summarise. With your example, it can be something like that.

    NB: To calculate weighted.mean of Col4 by Col3, you need to pass this function before calculating the sum of Col3, otherwise length of Col4 and Col3 will differ.

    You can then reorganize your dataframe in the correct order using select:

    library(dplyr)
    df %>% group_by(Col1) %>%
      summarise(Col2 = mean(Col2),
                Col4 = weighted.mean(Col4,Col3),
                Col3 = sum(Col3)) %>%
      select(Col1,Col2,Col3,Col4)
    
    # A tibble: 5 x 4
      Col1   Col2  Col3  Col4
      <chr> <dbl> <int> <dbl>
    1 A     0.17     83 0.878
    2 B     0.939   103 0.869
    3 C     0.228    80 0.935
    4 D     0.566   306 0.733
    5 E     0.703   103 0.636
    

    Data

    structure(list(Col1 = c("A", "B", "C", "D", "D", "E"), Col2 = c(0.17, 
    0.939, 0.228, 0.566, 0.566, 0.703), Col3 = c(83L, 103L, 80L, 
    169L, 137L, 103L), Col4 = c(0.878, 0.869, 0.935, 0.851, 0.588, 
    0.636)), row.names = c(NA, -6L), class = c("data.table", "data.frame"
    ), .internal.selfref = <pointer: 0x561706072cc0>)