Search code examples
rgroup-byrscript

R script, how to group_by and max on factor values?


For a reporting purposes, I've got a data frame which is defined like this:

Data:
V df_ischemia 12 obs. of 2 variables
  record_id : 'labelled' chr "1001" "1001" "1001" "1001" "1002" ...
  ..- attr(*, "label")= chr "Patient number"
  ischemic: Factor w/ 2 levels "Unchecked","Checked": NA NA 1 1 NA 2 NA 1 NA 2 ...
  ..- attr(, "redcapLabels")= chr [1:2] "Unchecked" "Checked"
  ..- attr(, "redcapLevels")= int [1:2] 0 1
  ..- attr(, "label")= chr "Complication(s): Ischemia"

The real data frame has a couple of hundred rows, but for this example let's say it's got just 12 rows like this:

   | record_id  | ischemic
 1 | 1001       | NA
 2 | 1001       | NA
 3 | 1001       | Unchecked
 4 | 1001       | Unchecked
 5 | 1002       | NA
 6 | 1002       | Checked
 7 | 1003       | NA
 8 | 1003       | Unchecked
 9 | 1004       | NA
10 | 1004       | Checked
11 | 1004       | Checked
12 | 1004       | Checked

And the goal is to group it for patients with a 'Checked' value, so the expected output should be like this:

  | record_id  | ischemic
1 | 1002       | Checked
2 | 1004       | Checked

I figured just use group_by and max

df_ischemia <- group_by(record_id) %>% max(df_ischemia$ischemic)
# Error object 'record_id' not found

df_ischemia <- group_by(df_ischemia$record_id) %>% max(ischemic)
# no applicable method for 'group_by_' applied to an object of class "c('labelled', 'character')"

df_ischemia <- group_by(record_id) %>% summarise(df_ischemia$ischemic=max(df_ischemia$ischemic))
# Error: unexpected '=' ..

But that doesn't work, however the factor does have int values so a max should be possible(?). I read somewhere that the factor should be ordered. It looks like it's orderd, but no clue how to check if that is the case, or how to set the order of an existing factor.


Solution

  • We need summarise in the first case

    library(dplyr)
    df_comp_lrcsp %>% 
       group_by(record_id) %>% 
        summarise(Max =  comp_lrcsp___1[which.max(as.integer(comp_lrcsp___1))]) )
    

    The <- is at the wrong place i.e. the group_by is applied on a column 'record_id' without specifying the data 'df_comp_lrcsp', after grouping, the max is done on the full column 'comp_lrscp__1'. Also, the extraction may not work with function applied on top of it as well because of the chain.

    In the second code, the same issue without the data and the max applied without summarise. In the last, we have the 'data' not found issue along with the extraction of the full column. $ extracts the full column breaking the grouping