Search code examples
rdplyraggregateplyr

How do you aggregate rows to a factor variable with three levels?


I have a dataset where some participants have multiple rows and I need to aggregate the data in a way that every participant has only one row. The dataset contains different variable types (e.g., factors, date, age etc.) I have made a code that works and looks like this:

example4 <- SMARTdata_50j_diagc_2016  %>% 
  group_by( Patient_Id ) %>%  
  summarise( Groep = first( Groep ),
             Ziekenhuis_Nr = first( Ziekenhuis_Nr ),
             Ziekenhuistype = first( Ziekenhuistype ),
             aantalDBC = n(),
             aantalVervolg = sum( as.numeric( Zorgtype_Code ) ),
             Leeftijd = mean( Lft_patient_openenDBC ),
             MRI_nee_ja = max( ifelse( MRI_nee_ja == 0, 0, 1 ) ),
             aantalMRI = sum( MRI_Aantal ),
             Artroscopie_nee_ja = max( ifelse( Artroscopie_nee_jaz_jam == 0, 0, 1 ) ),
             aantalArtroscopie = sum( Artroscopie_aantal ),
             overigDBC = mean( Aantal_overigeDBC_bijopenen ),
             DBC_open = min( open_DBC ), 
             DBC_sluiten = max( sluiten_DBC ) ) %>% 
  as.data.frame()

This code gives me a single row for each participant. However, I have one more variable that I need to include in the new dataframe, but I do not know how to do that. The variable that I need to add is called 'Diagnose_Code' and is factor with two levels, namely 0 (standing for 1801) and 1 (standing for 1805).

For the participants that have multiple rows (in the original dataframe), there are participants that have both a 0 and a 1 for that variable. Now, in my new dataframe, I want to make a variable for 'Diagnose_Code' with three levels: 0 for if all rows of that participant are 0, 1 for if all rows of that participant are 1, and 2 for if the rows of that participant have both a 0 and a 1.

I do not know how to make this work. I struggled a bit with the ifelse code, but without success. Does anyone know how I can make this work in my code? Thank you in advance!


Solution

  • Using a toy dataset this can be achieved like so:

    library(dplyr)
    
    df <- data.frame(
      id = rep(1:3, each = 3),
      diagnosis_code = c(rep(1,3), rep(0, 3), c(1, 0, 1)),
      stringsAsFactors = FALSE
    )
    df %>% 
      group_by(id) %>% 
      summarise(diagnosis_code = case_when(
        all(diagnosis_code == 1) ~ 1,
        all(diagnosis_code == 0) ~ 0,
        TRUE ~ 2
      ))
    #> # A tibble: 3 x 2
    #>      id diagnosis_code
    #>   <int>          <dbl>
    #> 1     1              1
    #> 2     2              0
    #> 3     3              2
    

    Created on 2020-03-29 by the reprex package (v0.3.0)