This should be a basic question and there may well be duplicates, but I can't seem to find them, so please bear with me and point me to the right place. Thanks!
I have a data frame that contains integers with possible NAs and missing values. I'm computing row means (setting NAs to zero) and column means (skipping NAs). I'd like to then create a data frame (or table) containing the integers together with row means and column means. Here is an example data frame:
df <- data.frame(
'ID' = c("123A","456B","789C","1011","1213")
, 'Test 1' = c(55,65,60,NA,50)
, 'Test 2' = c(45,48,50,52,55)
, 'Test 3' = c(51,49,55,69,61)
)
df
ID Test.1 Test.2 Test.3
1 123A 55 45 51
2 456B 65 48 49
3 789C 60 50 55
4 1011 NA 52 69
5 1213 50 55 61
Here is the function that computes column means skipping NAs:
colMean <- function(df, na.rm = TRUE) {
if (na.rm) {
n <- rowSums(!is.na(df))
} else {
n <- ncol(df)
}
colMean <- colMeans(df, na.rm=na.rm)
return(rbind(df, "colMean" = colMean))
}
Here is the function that computes row means setting NAs to zero:
rowMeanz <- function(df) {
df[is.na(df)] <- 0
return(cbind(df, "rowMean" = rowMeans(df)))
}
One problem is that rbind alters the data type, in the sense that the integers are converted to floats (or appear to be) in the column labeled "Test.1":
colMean(df[sapply(df, is.numeric)])
Test.1 Test.2 Test.3
1 55.0 45 51
2 65.0 48 49
3 60.0 50 55
4 NA 52 69
5 50.0 55 61
colMean 57.5 50 57
In your answer, I'd be very grateful for an explanation of why only the first column appears to be affected in this case. Is it related to the presence of the NA in the column?
I have not observed the same problem with the other function, based on cbind:
rowMeanz(df[sapply(df, is.numeric)])
Test.1 Test.2 Test.3 rowMean
1 55 45 51 50.33333
2 65 48 49 54.00000
3 60 50 55 55.00000
4 0 52 69 40.33333
5 50 55 61 55.33333
Eventually I'd like to obtain a dataframe or table that would look like this:
ID Test.1 Test.2 Test.3 rowMean
1 123A 55 45 51 50.33333
2 456B 65 48 49 54.00000
3 789C 60 50 55 55.00000
4 1011 NA 52 69 40.33333
5 1213 50 55 61 55.33333
6 colMean 57.5 50 57
I'd appreciate if you would show me how to do this in not too many steps. I'm open to base R answers, as well as answers based on packages. These calculations will be done online inside a shiny app, so I'd particularly like to see efficient methods. Many thanks!
Best probably to convert the data to character format in the desired way and then put the pieces together.
df <- data.frame(
row.names = c("123A","456B","789C","1011","1213")
, 'Test 1' = c(55,65,60,NA,50)
, 'Test 2' = c(45,48,50,52,55)
, 'Test 3' = c(51,49,55,69,61)
)
colm <- colMeans(df, na.rm=TRUE)
d0 <- df
d0[is.na(d0)] <- 0
rowm <- rowMeans(d0)
dd <- format(df)
dc <- formatC(colm, digits=1, format="f")
dr <- formatC(rowm, digits=4, format="f")
out <- cbind(rbind(dd, colMeans=dc), rowMeans=c(dr, ""))
print(out, right=FALSE)
## Test.1 Test.2 Test.3 rowMeans
## 123A 55 45 51 50.3333
## 456B 65 48 49 54.0000
## 789C 60 50 55 55.0000
## 1011 NA 52 69 40.3333
## 1213 50 55 61 55.3333
## colMeans 57.5 50.0 57.0