Search code examples
rdataframegroup-bysum

How would I group by columns with equivalent prefix and sum them?


Let's say I have this dataframe:

> df <- data.frame(a1=c(1, 2, 3, 4, 5), b1=c(5, 4, 3, 2, 1), b2=c(10, 11, 12, 13, 14), a2=c(5, 6, 7, 8, 9), c1=c(100, 200, 300, 400, 500))
> df
  a1 b1 b2 a2  c1
1  1  5 10  5 100
2  2  4 11  6 200
3  3  3 12  7 300
4  4  2 13  8 400
5  5  1 14  9 500
> 

As you can see, I have columns a1 a2 that both start with a, I want to sum them row-wise and generate just one column of a, same for column b. Also for c column, just the c column should stay the same as it is since there are only one column that starts with c.

Desired output:

    a   b    c
1   6  15  100
2   8  15  200
3  10  15  300
4  12  15  400
5  14  15  500

Coming from a Python, in (Python Library), I could just simply use:

df.groupby(df.columns.str[:1], axis=1).sum()

But I am not sure how to do this in R.

I tried dplyr:

df %>% group_by(sub(".$", "", colnames(df))) %>%
  mutate(across(colnames(.), sum))

But an error popped up. New to R.


Solution

  • as.data.frame(lapply(unique(sapply(names(df), function(x) grep(substr(x,1,1), names(df)))), function(y) if(is.na(y[2])) df[[y[1]]] else df[[y[1]]] + df[[y[2]]]))
      c.6..8..10..12..14. c.15..15..15..15..15. c.100..200..300..400..500.
    1                   6                    15                        100
    2                   8                    15                        200
    3                  10                    15                        300
    4                  12                    15                        400
    5                  14                    15                        500
    

    You can then set names using:

    unique(substr(names(df),1,1))
    [1] "a" "b" "c"
    

    As commented:

    sapply(unique(sub(".$", "", colnames(df))), function(x) rowSums(df[startsWith(colnames(df), x)]))