Search code examples
rdataframedata-cleaning

How do I combine row entries for the same patient ID# in R while keeping other columns and NA values?


I need to combine some of the columns for these multiple IDs and can just use the values from the first ID listing for the others. For example here I just want to combine the "spending" column as well as the heart attack column to just say whether they ever had a heart attack. I then want to delete the duplicate ID#s and just keep the values from the first listing for the other columns:

df <- read.table(text = 
              "ID  Age   Gender  heartattack    spending 
               1 24 f 0 140
               2 24 m na 123
               2 24 m 1 58
               2 24 m 0 na
               3 85 f 1 170
               4 45 m na 204", header=TRUE)

What I need:

df2 <- read.table(text = 
            "ID  Age   Gender  ever_heartattack all_spending 
             1 24 f 0 140
             2 24 m 1 181
             3 85 f 1 170
             4 45 m na 204", header=TRUE)

I tried group_by with transmute() and sum() as follows:

df$heartattack = as.numeric(as.character(df$heartattack))
df$spending  = as.numeric(as.character(df$spending))

library(dplyr)
df = df %>% group_by(ID) %>% transmute(ever_heartattack = sum(heartattack, na.rm = T), all_spending = sum(spending, na.rm=T))

But this removes all the other columns! Also it turns NA values into zeros...for example I still want "NA" to be the value for patient ID#4, I don't want to change the data to say they never had a heart attack!

> print(dfa) #This doesn't at all match df2 :(
  ID ever_heartattack all_spending
1  1                0          140
2  2                1          181
3  2                1          181
4  2                1          181
5  3                1          170
6  4                0          204

Solution

  • Could you do this?

    aggregate(
        spending ~ ID + Age + Gender, 
        data = transform(df, spending = as.numeric(as.character(spending))),
        FUN = sum)
    #  ID Age Gender spending
    #1  1  24      f      140
    #2  3  85      f      170
    #3  2  24      m      181
    #4  4  45      m      204
    

    Some comments:

    1. The thing is that when aggregating you don't give clear rules how to deal with data in additional columns that differ (like heartattack in this case). For example, for ID = 2 why do you retain heartattack = 1 instead of heartattack = na or heartattack = 0?

    2. Your "na"s are in fact not real NAs. That leads to spending being a factor column instead of a numeric column vector.


    To exactly reproduce your expected output one can do

    df %>%
        mutate(
            heartattack = as.numeric(as.character(heartattack)),
            spending = as.numeric(as.character(spending))) %>%
        group_by(ID, Age, Gender) %>%
        summarise(
            heartattack = ifelse(
                any(heartattack %in% c(0, 1)),
                max(heartattack, na.rm = T),
                NA),
            spending = sum(spending, na.rm = T))
    ## A tibble: 4 x 5
    ## Groups:   ID, Age [?]
    #     ID   Age Gender heartattack spending
    #  <int> <int> <fct>        <dbl>    <dbl>
    #1     1    24 f                0      140
    #2     2    24 m                1      181
    #3     3    85 f                1      170
    #4     4    45 m               NA      204
    

    This feels a bit "hacky" on account of the rules not being clear which heartattack value to keep. In this case we

    • keep the maximum value of heartattack if heartattack contains either 0 or 1.
    • return NA if heartattack does not contain 0 or 1.