Search code examples
rrowsummary

summarizing the proportion of records in each row of dataframe


I have large dataframe which has some columns with only 'NA' values. I want to summarize each row in term of percentage.
let's say: df

user col1 col2 col3 col4 col5 col6
 100   1    1    2   2    1    NA
 200   1    2    3   3    NA   NA
 300   2    3    3   3    2    NA

I want to summarize the user rows based on the percentage of total members.
for example, user:100 has 3/5 of event 1 and 2/5 event 2.

summarized_df:

user event1 event2 event3
100    3/5   2/5    0
200    1/4   1/4    2/4
300    0     2/5    3/5

it is also useful to use percentage for each event.
how can I do it in R?


Solution

  • Here is a base R method with apply, table, and prop.table.

    cbind(dat[1],
          prop.table(t(apply(dat[-1], 1,
                       function(x) table(factor(x, levels=1:3)))), 1))
    

    factor is needed to assure that the output of table applied to each row returns every potential element (1:3), even if one or more levels is not observed. Here, apply runs through all of the rows, and returns counts for each event, including 0s when the event did not occur. Because the output of each call has the same length, apply returns a matrix. We transpose the matrix and the use prop.table to calculate the the proportion of each event for each row. Finally, cbind combines the first column with this matrix returning a data.frame with the desired output.

    this returns

      user    1    2   3
    1  100 0.60 0.40 0.0
    2  200 0.25 0.25 0.5
    3  300 0.00 0.40 0.6
    

    data

    dat <- 
    structure(list(user = c(100L, 200L, 300L), col1 = c(1L, 1L, 2L
    ), col2 = 1:3, col3 = c(2L, 3L, 3L), col4 = c(2L, 3L, 3L), col5 = c(1L, 
    NA, 2L), col6 = c(NA, NA, NA)), .Names = c("user", "col1", "col2", 
    "col3", "col4", "col5", "col6"), class = "data.frame", row.names = c(NA,-3L))