I have a data frame that has some records that are duplicated and I need to aggregate the duplicates so there is a unique record per row.
An example:
Col1 Col2 Col3 Col4
A 0.170 83 0.878
B 0.939 103 0.869
C 0.228 80 0.935
D 0.566 169 0.851
D 0.566 137 0.588
E 0.703 103 0.636
I need to weight the average of Col4 with Col3, and sum Col3. So my result would be:
Col1 Col2 Col3 Col4
A 0.17 83 0.878
B 0.939 103 0.869
C 0.228 80 0.935
D 0.566 306 0.733
E 0.703 103 0.636
Usually I would use the aggregate function but I can't seem to find a solution to include two different function types. Is there another way I can accomplish this? I am effectively ignoring Col 2 since the granularity before merging with the data that brought in Col3 and Col4 was one record per row, and now it is being duplicated.
Thank you!!
Using dplyr
, you can use group_by
to keep all unique rows of "Col1" and then pass all your different function into summarise
. With your example, it can be something like that.
NB: To calculate weighted.mean
of Col4 by Col3, you need to pass this function before calculating the sum
of Col3, otherwise length of Col4 and Col3 will differ.
You can then reorganize your dataframe in the correct order using select
:
library(dplyr)
df %>% group_by(Col1) %>%
summarise(Col2 = mean(Col2),
Col4 = weighted.mean(Col4,Col3),
Col3 = sum(Col3)) %>%
select(Col1,Col2,Col3,Col4)
# A tibble: 5 x 4
Col1 Col2 Col3 Col4
<chr> <dbl> <int> <dbl>
1 A 0.17 83 0.878
2 B 0.939 103 0.869
3 C 0.228 80 0.935
4 D 0.566 306 0.733
5 E 0.703 103 0.636
Data
structure(list(Col1 = c("A", "B", "C", "D", "D", "E"), Col2 = c(0.17,
0.939, 0.228, 0.566, 0.566, 0.703), Col3 = c(83L, 103L, 80L,
169L, 137L, 103L), Col4 = c(0.878, 0.869, 0.935, 0.851, 0.588,
0.636)), row.names = c(NA, -6L), class = c("data.table", "data.frame"
), .internal.selfref = <pointer: 0x561706072cc0>)