Search code examples
rdataframeconsolidation

How can I combine sparse columns in a data fame?


I have an R script that creates a data frame with 61 columns. The data.frame is made by reading a bunch of csv files into a list of data.frames, then merging the list such that commonly named columns in each data.frame in the list populate the same column in the resulting data.frame.

Some of the columns that should be combined have been inconsistently named in the csv files (eg date.received vs received.on.date vs date.sample.received), and I was wondering what the best approach to combining them would be.

I had a couple ideas:

  • rename the columns before merging in a big lapply over the list.
  • combine the columns that should be the same once I have my data.frame, such that the column which has a value in that row is used

is the second approach possible (and how?) or is there a better way?


Solution

  • The second approach is possible and it goes easy with rbind_all from dplyr package. Here is how:

    First of all, if you have some information about the pattern of the names of columns that should be stacked together, I suggest you to try to fix it before stacking, like:

    colnames_synonymous <- c("date.received", "received.on.date", "date.sample.received")
    
    list_of_dfs <- lapply(list_of_dfs, function(df) {
      names(df)[names(df) %in% colnames_synonymous] <- "date_received"
      return(df)
    })
    

    Now you are good to go:

    dplyr::rbind_all(list_of_dfs)
    

    Maybe you will have to do some adjustments before getting all columns stacked right, but now, all you need to do is changing the lapply function to do so. I find this way easier than make some columns transformations after rbinding.