Search code examples
rdplyrr-factor

Create column with minimum (first) level of factors by group r


I have a data frame with a mixture of patient and appointment information. Each patient may have attended multiple appointments. Some patient information is collected at each appointment, with the result that some patient information is conflicting and some is missing.

I want to fill in the missing patient information for appointments where it wasn't recorded based upon data recorded at other appointments. And (this is where I'm stuck) I want to take the 'minimum' level (earliest in order of levels) recorded for factors that have conflicting information for the same patient. (In the example below the factor levels are in alphabetical order , but this isn't always the case).

This is similar to this question, but I am using factors instead of characters, and I have multiple factors that I want the minimum for so can't filter by row.

eg. I have

df.have <- data.frame(
  grp_id = rep(1:3, each = 2),
  grpvar1 = factor(c("B", "A", "B", "C", NA, "A")),
  grpvar2 = factor(c("a", "b", "c", NA, NA, "x")),
  appt_id = 1:6)

I want

grp_id grpvar1   grpvar2 appt_id
     1       A         a       1
     1       A         a       2
     2       B         c       3
     2       B         c       4
     3       A         x       5
     3       A         x       6

or at least

grp_id grpvar1 grpvar1.1
     1       A         a
     2       B         c
     3       A         x

Solution

  • We can try with summarise_each. As we need the first level of 'grpvar' variables, we need to make sure that the unused levels are dropped before we do that (using droplevels).

    df.have %>% 
         group_by(grp_id) %>% 
         summarise_each(funs(first(levels(droplevels(.)))), grpvar1:grpvar2)
    #   grp_id grpvar1 grpvar2
    #   <int>   <chr>   <chr>
    #1      1       A       a
    #2      2       B       c
    #3      3       A       x
    

    Or if we use mutate_each, we get the first output

    df.have %>% 
        group_by(grp_id) %>%
        mutate_each(funs(levels(droplevels(.))[1]), grpvar1:grpvar2)    
    #  grp_id grpvar1 grpvar2 appt_id
    #   <int>   <chr>   <chr>   <int>
    #1      1       A       a       1
    #2      1       A       a       2
    #3      2       B       c       3
    #4      2       B       c       4
    #5      3       A       x       5
    #6      3       A       x       6
    

    if we need the output as factor columns for 'grpvar's.

     df.have %>% 
        group_by(grp_id) %>%
        mutate_each(funs(factor(levels(droplevels(.))[1])), grpvar1:grpvar2)    
     #   grp_id grpvar1 grpvar2 appt_id
     #    <int>  <fctr>  <fctr>   <int>
     #1      1       A       a       1
     #2      1       A       a       2
     #3      2       B       c       3
     #4      2       B       c       4
     #5      3       A       x       5
     #6      3       A       x       6
    

    Or using data.table

    library(data.table)
    setDT(df.have)[, lapply(.SD, function(x) levels(droplevels(x))[1]) , 
                    grp_id, .SDcols = grpvar1:grpvar2]