I have large dataframe which has some columns with only 'NA' values. I want to summarize each row in term of percentage.
let's say: df
user col1 col2 col3 col4 col5 col6
100 1 1 2 2 1 NA
200 1 2 3 3 NA NA
300 2 3 3 3 2 NA
I want to summarize the user
rows based on the percentage of total members.
for example, user:100 has 3/5 of event 1 and 2/5 event 2.
summarized_df:
user event1 event2 event3
100 3/5 2/5 0
200 1/4 1/4 2/4
300 0 2/5 3/5
it is also useful to use percentage for each event.
how can I do it in R?
Here is a base R method with apply
, table
, and prop.table
.
cbind(dat[1],
prop.table(t(apply(dat[-1], 1,
function(x) table(factor(x, levels=1:3)))), 1))
factor
is needed to assure that the output of table
applied to each row returns every potential element (1:3
), even if one or more levels is not observed. Here, apply
runs through all of the rows, and returns counts for each event, including 0s when the event did not occur. Because the output of each call has the same length, apply
returns a matrix. We transpose the matrix and the use prop.table
to calculate the the proportion of each event for each row. Finally, cbind
combines the first column with this matrix returning a data.frame with the desired output.
this returns
user 1 2 3
1 100 0.60 0.40 0.0
2 200 0.25 0.25 0.5
3 300 0.00 0.40 0.6
data
dat <-
structure(list(user = c(100L, 200L, 300L), col1 = c(1L, 1L, 2L
), col2 = 1:3, col3 = c(2L, 3L, 3L), col4 = c(2L, 3L, 3L), col5 = c(1L,
NA, 2L), col6 = c(NA, NA, NA)), .Names = c("user", "col1", "col2",
"col3", "col4", "col5", "col6"), class = "data.frame", row.names = c(NA,-3L))