I have a table with web site pages and their visits. In some cases there are rows that are duplicate. I want to deduplicate rows based on yearMonth and page columns while summing users and sessions columns.
| yearMonth | page | users | sessions | bounceRate |
| --------- | ------ | ----- | -------- | ---------- |
| 202405 | a.html | 20 | 22 | 0.48 |
| 202406 | b.html | 14 | 16 | 0.52 |
| 202405 | a.html | 2 | 3 | 0.90 |
For column bounceRate I want to either use weighted average of those rows (if possible) or use the value from row with the highest amount of sessions.
| yearMonth | page | users | sessions | bounceRate |
| --------- | ------ | ----- | -------- | ------------------ |
| 202405 | a.html | 22 | 25 | 0.48 (value of max)|
| 202406 | b.html | 14 | 16 | 0.52 |
I tried the script below but it always sums the bounceRate column instead of using the value from row with the most sessions:
data <- data.frame(
yearMonth = c(202405, 202406, 202405),
page = c("a.html", "b.html", "a.html"),
users = c(20, 14, 2),
sessions = c(22, 16, 3),
bounceRate = c(0.48, 0.52, 0.90)
)
result <- aggregate(. ~ yearMonth + page, data, function(x) {
if (any(names(x) == "bounceRate")) {
x[which.max(data$Sessions)]
} else {
sum(x)
}
})
print(result)
For the bounceRate associated with the maximum session value you can use slice_max
from dplyr:
library(dplyr)
summarise(data, users=sum(users), sessions=sum(sessions), .by=c(yearMonth, page)) |>
inner_join(
slice_max(data, sessions, by=c(yearMonth, page)) |>
select(yearMonth, page, bounceRate),
by=c("yearMonth", "page"))
yearMonth page users sessions bounceRate
1 202405 a.html 22 25 0.48
2 202406 b.html 14 16 0.52
For the mean bounceRate:
summarise(data,
users=sum(users),
sessions=sum(sessions),
bounceRate=mean(bounceRate), .by=c(yearMonth, page))
yearMonth page users sessions bounceRate
1 202405 a.html 22 25 0.69
2 202406 b.html 14 16 0.52