Search code examples
rdata-cleaning

Combining columns with a grouping in R


I need to categorize information from column names and restructure a dataset. Here is how my sample dataset looks like:

df <- data.frame(id = c(111,112,113),
                 Demo_1_Color_Naming = c("Text1","Text1","Text1"),
                 Demo_1.Errors =c(0,1,2),
                 Item_1_Color_Naming = c("Text1","Text1","Text1"),
                 Item_1.Time_in_Seconds =c(10,NA, 44),
                 Item_1.Errors = c(0,1,NA),
                 Demo_2_Shape_Naming = c("Text1","Text1","Text1"),
                 Demo_2.Errors =c(4,1,5),
                 Item_2_Shape_Naming = c("Text1","Text1","Text1"),
                 Item_2.Time_in_Seconds =c(55,35, 22),
                 Item_2.Errors = c(5,2,NA))

> df
   id Demo_1_Color_Naming Demo_1.Errors Item_1_Color_Naming Item_1.Time_in_Seconds Item_1.Errors Demo_2_Shape_Naming Demo_2.Errors
1 111               Text1             0               Text1                     10             0               Text1             4
2 112               Text1             1               Text1                     NA             1               Text1             1
3 113               Text1             2               Text1                     44            NA               Text1             5
  Item_2_Shape_Naming Item_2.Time_in_Seconds Item_2.Errors
1               Text1                     55             5
2               Text1                     35             2
3               Text1                     22            NA

The columns are grouped by the numbers 1,2,3,... Each number represesents a grouping name. For example number 1 in this dataset represents Color grouping where number 2 represents Shape grouping. I would like to keep Time_in_seconds info and Errors info. Then I need to sum both time and errors.

Additionally, this dataset is only limited to two grouping. The bigger dataset has more than 2 grouping. I need to handle this for a multi group/column.

How can I achieve this below:

> df1
   id ColorTime ShapeTime ColorError ShapeError TotalTime TotalError
1 111        10        55          0          5        65          5
2 112        NA        35          1          2        35          3
3 113        44        22         NA         NA        66         NA

Solution

  • We may do

    cbind(df['id'],  do.call(cbind, lapply(setNames(c("Time_in_Seconds", 
        "Item.*Errors"), c("Time_in_Seconds", "Errors")), \(x) {
      tmp <- df[grep(x, names(df), value = TRUE)]
      out <- setNames(as.data.frame(sapply(split.default(tmp, 
        gsub("\\D+", "", names(tmp))), rowSums, na.rm = TRUE)), c("Color", "Shape"))
      transform(out, Total = rowSums(out))
     })))
    

    -output

    id Time_in_Seconds.Color Time_in_Seconds.Shape Time_in_Seconds.Total Errors.Color Errors.Shape Errors.Total
    1 111                    10                    55                    65            0            5            5
    2 112                     0                    35                    35            1            2            3
    3 113                    44                    22                    66            0            0            0
    

    If we need the NAs

    cbind(df['id'], do.call(cbind, lapply(setNames(c("Time_in_Seconds", 
        "Item.*Errors"), c("Time_in_Seconds", "Errors")), \(x) {
      tmp <- df[grep(x, names(df), value = TRUE)]
      out <- setNames(as.data.frame(sapply(split.default(tmp, 
        gsub("\\D+", "", names(tmp))), \(u) Reduce(`+`, u))), c("Color", "Shape")); transform(out, Total = rowSums(out, na.rm = TRUE)) })))