Search code examples
rdplyr

How to group columns with an odd index, based on columns with an even index?


I am the beginner of R programming. I have to group the dataframe in R by the condition such as: it's all odd number of column names need to grouped based on it's corresponding values present in the even number of columns. I have attached the example images. enter image description here enter image description here

I have used the following R script which I got from chatGPT, however it is not giving accurate result. Though it groups the values correctly, it returns more number of rows and the order of columns also changing in the output result.

library(dplyr)
library(purrr)
library(tidyr)

group_odd_by_even <- function(df) {
  # Get the column names
  col_names <- colnames(df)
  
  # Identify odd and even indexed columns
  odd_cols <- col_names[seq(1, length(col_names), by = 2)]
  even_cols <- col_names[seq(2, length(col_names), by = 2)]
  
  # Initialize an empty list to store grouped data
  grouped_list <- map2(odd_cols, even_cols, ~ {
    # Group odd columns by their respective even columns
    df %>%
      select(!!sym(.x), !!sym(.y)) %>%
      group_by(!!sym(.y)) %>%
      summarise(!!sym(.x) := paste(!!sym(.x), collapse = ", ")) %>%
      rename(!!sym(.y) := !!sym(.y))
  })
  
  # Reduce all grouped dataframes into a single dataframe by joining on even columns
  grouped_df <- reduce(grouped_list, ~full_join(.x, .y, by = intersect(colnames(.x), colnames(.y))))
  
  return(grouped_df)
}

The example input data,

structure(list(col1 = c("A", "B", "C", "D"), col2 = c(1, 2, 3, 
4), col3 = c("A", "B", "C", "D"), col4 = c(2, 3, 2, 3), col5 = c("A", 
"B", "C", "D"), col6 = c(1, 2, 3, 1), col7 = c("A", "B", "C", 
"D"), col8 = c(1, 1, 1, 1), col9 = c("A", "B", "C", "D"), col10 = c(1, 
1, 1, 1)), class = "data.frame", row.names = c(NA, -4L))

Please help me to write the generalized function for that. Thank you in advance.


Solution

  • I'd be curious to see any approaches that are simpler and shorter. Here's an approach using reshaping.

    First I add row numbers for tracking, and reshape longer. From here, we can add variables to track which pair of columns we're in, and which type of column (value or group) we're in.

    Then we can reshape wide again so each row has a grouping column and a value column. We can use summarize to concatenate the values within each group in each column pair.

    And finally reshape wide again.

    library(tidyverse)
    
    df |>
      mutate(row = row_number()) |>
      pivot_longer(-row, values_transform = as.character) |>
      mutate(pair_num = (row_number() + 1) %/% 2, 
             type = if_else(row_number() %% 2 == 1, "val", "grp"), .by = row) |>
      select(-name) |>
    
      pivot_wider(names_from = type, values_from = value) |>
      summarize(vals = paste0(val, collapse = ", "),
                .by = c(pair_num, grp)) |>
      mutate(row = row_number(), .by = pair_num) |>
    
      pivot_wider(names_from = pair_num, values_from = c(vals, grp), names_vary = "slowest") |>
      select(-row) |>
      `colnames<-`(colnames(df)) # apply original column names
    

    Result (on provided input data, which varies from pictures in OP)

      col1  col2  col3  col4  col5  col6  col7       col8  col9       col10
      <chr> <chr> <chr> <chr> <chr> <chr> <chr>      <chr> <chr>      <chr>
    1 A     1     A, C  2     A, D  1     A, B, C, D 1     A, B, C, D 1    
    2 B     2     B, D  3     B     2     NA         NA    NA         NA   
    3 C     3     NA    NA    C     3     NA         NA    NA         NA   
    4 D     4     NA    NA    NA    NA    NA         NA    NA         NA