Search code examples
rdataframemean

Create a new column by calcuating the mean of existing columns - according to another dataframe


I have two big dataframes. I will use dummy data to explain what I need.

Say I have this gene expression data, lets call it df1:

structure(list(Sample1 = c(102, 34, 77, 13, 10), Sample2 = c(1000, 
23, 199, 0, 1), Sample3 = c(293, 234, 891, 230, 2002), Sample4 = c(672, 
756, 431, 293, 0), Sample5 = c(28, 2910, 0, 120, 129)), class = "data.frame", row.names = c("GeneA", 
"GeneB", "GeneC", "GeneD", "GeneE"))

And I have the metadata, which is additional information about the samples in df1. Lets call it df2:

structure(list(Age = c(54, 70, 49, 23, 90), Patient = c(10, 10, 
42, 108, 20)), class = "data.frame", row.names = c("Sample1", 
"Sample2", "Sample3", "Sample4", "Sample5"))

The colnames in df1 are the same as the rownames as df2. If you look at df2, you'll see the column Patient. I want for the samples that have the same value in the Patient column, to calculate their mean and make them one united sample in df1.

So here, both Sample1 and Sample2 have the value 10 in the Patient column (in df2). Therefore, those two columns in df1, I want to make them one column, which is their average. Please note that This is just a dummy, so there might be more than only two samples that share the same value in this column. Sample5 or Sample6 and Sample7 might also share the same value.

What do I mean their average? That is to calculate the sum of each row individually and devide by two.

So here, the first row (for GeneA) for this new united sample would be: (102 + 1000)/2 = 551

The next is (34 + 23)/2 = 28.5


Solution

  • We may split the row names of 'df2' by the 'Patient' column into a list. Then, if the length of the list elements are greater than 1, get the average by looping over the list and applying rowMeans and then create new columns in 'df1'

    lst1 <- split(row.names(df2), df2$Patient)
    names(lst1) <- paste0(sapply(lst1, paste, collapse = "_"), "_average")
    i1 <- lengths(lst1) > 1
    df1[names(lst1)[i1]] <-lapply(lst1[i1], 
         function(nm) rowMeans(df1[nm], na.rm = TRUE))
    

    -output

    > df1
          Sample1 Sample2 Sample3 Sample4 Sample5 Sample1_Sample2_average
    GeneA     102    1000     293     672      28                   551.0
    GeneB      34      23     234     756    2910                    28.5
    GeneC      77     199     891     431       0                   138.0
    GeneD      13       0     230     293     120                     6.5
    GeneE      10       1    2002       0     129                     5.5