Search code examples
rmissing-datacategorical-datasummary

How to summarise a categorical variable with missing data?


I'm trying to perform a group_by summarise on a categorical variable, frailty score. The data is structured such that there are multiple observations for each subject, some of which contain missing data e.g.

Subject  Frailty
1        Managing well
1        NA
1        NA
2        NA
2        NA
2        Vulnerable
3        NA
3        NA
3        NA

I would like the data to be summarised so that a frailty description appears if there is one available, and NA if not e.g.

Subject  Frailty
1        Managing well
2        Vulnerable 
3        NA

I tried the following two approaches which both returned errors:

Mode <- function(x) {
ux <- na.omit(unique(x[!is.na(x)]))
tab <- tabulate(match(x, ux)); ux[tab == max(tab)]
}

data %>% 
group_by(Subject) %>% 
summarise(frailty = Mode(frailty)) %>% 

Error: Expecting a single value: [extent=2].
condense <- function(x){unique(x[!is.na(x)])}

data %>% 
group_by(subject) %>% 
summarise(frailty = condense(frailty))

Error: Column frailty must be length 1 (a summary value), not 0

Solution

  • If there are only one a single non-NA element, then after grouping by 'Subject', get the first non-NA element

    library(dplyr)
    data %>%
       group_by(Subject) %>%
       summarise(Frailty = Frailty[which(!is.na(Frailty))[1]])
    # A tibble: 3 x 2
    #  Subject Frailty      
    #    <int> <chr>        
    #1       1 Managing well
    #2       2 Vulnerable   
    #3       3 <NA>       
    

    If there are more than one non-NA unique elements, either we paste them together or return as a list

    data %>%
        group_by(Subject) %>%
        summarise(Frailty = na_if(toString(unique(na.omit(Frailty))), ""))
    # A tibble: 3 x 2
    #  Subject Frailty      
    #    <int> <chr>        
    #1       1 Managing well
    #2       2 Vulnerable   
    #3       3 <NA>      
    

    data

    data <- structure(list(Subject = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L
    ), Frailty = c("Managing well", NA, NA, NA, NA, "Vulnerable", 
    NA, NA, NA)), class = "data.frame", row.names = c(NA, -9L))