Search code examples
rduplicates

Aggregating rows in a table using multiple aggregate operations based on column name in R


I have a table with web site pages and their visits. In some cases there are rows that are duplicate. I want to deduplicate rows based on yearMonth and page columns while summing users and sessions columns.

| yearMonth | page   | users | sessions | bounceRate |
| --------- | ------ | ----- | -------- | ---------- |
| 202405    | a.html | 20    | 22       | 0.48       |
| 202406    | b.html | 14    | 16       | 0.52       |
| 202405    | a.html | 2     | 3        | 0.90       |

For column bounceRate I want to either use weighted average of those rows (if possible) or use the value from row with the highest amount of sessions.

| yearMonth | page   | users | sessions |     bounceRate     |
| --------- | ------ | ----- | -------- | ------------------ |
| 202405    | a.html | 22    | 25       | 0.48 (value of max)|
| 202406    | b.html | 14    | 16       | 0.52               |

I tried the script below but it always sums the bounceRate column instead of using the value from row with the most sessions:

data <- data.frame(
  yearMonth = c(202405, 202406, 202405),
  page = c("a.html", "b.html", "a.html"),
  users = c(20, 14, 2),
  sessions = c(22, 16, 3),
  bounceRate = c(0.48, 0.52, 0.90)
)

result <- aggregate(. ~ yearMonth + page, data, function(x) {
  if (any(names(x) == "bounceRate")) {
      x[which.max(data$Sessions)]
  } else {
      sum(x)
  }
})

print(result)

Solution

  • For the bounceRate associated with the maximum session value you can use slice_max from dplyr:

    library(dplyr)
    
    summarise(data, users=sum(users), sessions=sum(sessions), .by=c(yearMonth, page)) |>
      inner_join(
            slice_max(data, sessions, by=c(yearMonth, page)) |>
      select(yearMonth, page, bounceRate), 
        by=c("yearMonth", "page"))
    

      yearMonth   page users sessions bounceRate
    1    202405 a.html    22       25       0.48
    2    202406 b.html    14       16       0.52
    

    For the mean bounceRate:

    summarise(data, 
       users=sum(users), 
       sessions=sum(sessions), 
       bounceRate=mean(bounceRate), .by=c(yearMonth, page))
    

      yearMonth   page users sessions bounceRate
    1    202405 a.html    22       25       0.69
    2    202406 b.html    14       16       0.52