Search code examples
rdata.tablepercentagenasummary

Calculate metrics for multiple columns based on subsets defined by other columns


I would like to calculate simple summary metrics for subsets of certain columns in a data frame, where the subsets are based on information in other columns of the same data frame. Let me illustrate:

colA <- c(NA,2,3,NA,NA,3,9,5,6,1)
colB <- c(9,3,NA,2,2,4,6,1,9,9)
colC <- c(NA,NA,5,7,3,9,8,1,2,3)
colAA <- c(NA,NA,6,NA,NA,NA,1,7,9,4)
colBB <- c(NA,2,NA,7,8,NA,2,7,9,4)
colCC <- c(NA,NA,3,7,5,8,9,9,NA,3)

df <- data.frame(colA,colB,colC,colAA,colBB,colCC)

> df
   colA colB colC colAA colBB colCC
1    NA    9   NA    NA    NA    NA
2     2    3   NA    NA     2    NA
3     3   NA    5     6    NA     3
4    NA    2    7    NA     7     7
5    NA    2    3    NA     8     5
6     3    4    9    NA    NA     8
7     9    6    8     1     2     9
8     5    1    1     7     7     9
9     6    9    2     9     9    NA
10    1    9    3     4     4     3

Here colAA should be subsetted by colA so that rows containing NAs in colA are removed:

> df1 <- subset(df, !is.na(colA))
> df1
   colA colB colC colAA colBB colCC
2     2    3   NA    NA     2    NA
3     3   NA    5     6    NA     3
6     3    4    9    NA    NA     8
7     9    6    8     1     2     9
8     5    1    1     7     7     9
9     6    9    2     9     9    NA
10    1    9    3     4     4     3

Now I would like to calculate e.g. column length and percentage of non-NA values within the column:

> length(df1$colAA)
[1] 7
> (nrow(subset(df1, !is.na(colAA)))/length(df1$colAA))*100
[1] 71.42857

In an ideal world, the output would be written to another data frame, e.g.:

cat n perc_n
1 colAA 7     71
2 colBB 9     78
3 colCC 8     88

Any way to achieve this for all columns in a slighty more elegant/efficient manner? Any suggestions will be much appreciated!


Solution

  • You can pass the two sets of columns to Map:

    res = Map(function(x,y) summary(y[!is.na(x)]), df[,1:3], df[, 4:6])
    

    Since the post is tagged with data.table, I'd also recommend making a table like

    data.table::rbindlist(lapply(res, as.list), id="col")
    
    #     col Min. 1st Qu. Median  Mean 3rd Qu. Max. NA's
    # 1: colA    1       4      6 5.400     7.0    9    2
    # 2: colB    2       3      7 5.571     7.5    9    2
    # 3: colC    3       4      7 6.286     8.5    9    1
    

    You can replace summary with whatever function you like that returns a named vector and it should still work with as.list + rbindlist.