Search code examples
rmergedataframerbindcbind

binding data frames with different data types


This should be a basic question and there may well be duplicates, but I can't seem to find them, so please bear with me and point me to the right place. Thanks!

I have a data frame that contains integers with possible NAs and missing values. I'm computing row means (setting NAs to zero) and column means (skipping NAs). I'd like to then create a data frame (or table) containing the integers together with row means and column means. Here is an example data frame:

df <- data.frame(
  'ID' = c("123A","456B","789C","1011","1213")
  , 'Test 1' = c(55,65,60,NA,50)
  , 'Test 2' = c(45,48,50,52,55)
  , 'Test 3' = c(51,49,55,69,61)
 )
df
    ID Test.1 Test.2 Test.3
1 123A     55     45     51
2 456B     65     48     49
3 789C     60     50     55
4 1011     NA     52     69
5 1213     50     55     61

Here is the function that computes column means skipping NAs:

colMean <- function(df, na.rm = TRUE) {
  if (na.rm) {
    n <- rowSums(!is.na(df))
  } else {
    n <- ncol(df)
  }
  colMean <- colMeans(df, na.rm=na.rm)
  return(rbind(df, "colMean" = colMean))
}

Here is the function that computes row means setting NAs to zero:

rowMeanz <- function(df) {
  df[is.na(df)] <- 0
  return(cbind(df, "rowMean" = rowMeans(df)))
}

One problem is that rbind alters the data type, in the sense that the integers are converted to floats (or appear to be) in the column labeled "Test.1":

colMean(df[sapply(df, is.numeric)])
        Test.1 Test.2 Test.3
1         55.0     45     51
2         65.0     48     49
3         60.0     50     55
4           NA     52     69
5         50.0     55     61
colMean   57.5     50     57

In your answer, I'd be very grateful for an explanation of why only the first column appears to be affected in this case. Is it related to the presence of the NA in the column?

I have not observed the same problem with the other function, based on cbind:

rowMeanz(df[sapply(df, is.numeric)])
  Test.1 Test.2 Test.3  rowMean
1     55     45     51 50.33333
2     65     48     49 54.00000
3     60     50     55 55.00000
4      0     52     69 40.33333
5     50     55     61 55.33333

Eventually I'd like to obtain a dataframe or table that would look like this:

    ID Test.1 Test.2 Test.3  rowMean
1 123A     55     45     51 50.33333
2 456B     65     48     49 54.00000
3 789C     60     50     55 55.00000
4 1011     NA     52     69 40.33333
5 1213     50     55     61 55.33333
6 colMean  57.5   50     57 

I'd appreciate if you would show me how to do this in not too many steps. I'm open to base R answers, as well as answers based on packages. These calculations will be done online inside a shiny app, so I'd particularly like to see efficient methods. Many thanks!


Solution

  • Best probably to convert the data to character format in the desired way and then put the pieces together.

    df <- data.frame(
      row.names = c("123A","456B","789C","1011","1213")
      , 'Test 1' = c(55,65,60,NA,50)
      , 'Test 2' = c(45,48,50,52,55)
      , 'Test 3' = c(51,49,55,69,61)
     )
    
    colm <- colMeans(df, na.rm=TRUE)
    d0 <- df
    d0[is.na(d0)] <- 0
    rowm <- rowMeans(d0)
    
    dd <- format(df)
    dc <- formatC(colm, digits=1, format="f")
    dr <- formatC(rowm, digits=4, format="f")
    out <- cbind(rbind(dd, colMeans=dc), rowMeans=c(dr, ""))
    print(out, right=FALSE)
    
    ##          Test.1 Test.2 Test.3 rowMeans
    ## 123A     55     45     51     50.3333 
    ## 456B     65     48     49     54.0000 
    ## 789C     60     50     55     55.0000 
    ## 1011     NA     52     69     40.3333 
    ## 1213     50     55     61     55.3333 
    ## colMeans 57.5   50.0   57.0