Search code examples
rreshapesummary

Count people ever experiencing condition in data with row per doctor's visit


I'm relatively new to R, so I apologize if this question has already been answered and I was just unable to find it because I don't know what to search for.

I have a very large health-related dataset where there is a row for each doctor's visit that occurred in some specific period of time. One of the variables in the dataset is a patient ID. It's clear that there's a huge amount of duplication of patient IDs in the dataset--I have about half a million unique patient IDs and about 9 million observations. Other variables in the data are diseaseA, diseaseB, and diseaseC. Each of these is a 0 or 1--0 meaning the patient did not have the disease at the time of the visit and 1 meaning they did have the disease. One last important variable is race/ethnicity.

What I ultimately want is:

  1. A way to count the number of patients who ever had each disease during a doctor's visit. So, for example, if there is a 1 for diseaseA in any of the observations where patient ID == 1, that patient would be counted as ever having diseaseA.
  2. Among the people who ever had each disease, a way to count how many times they went to the doctor's office with that disease so that I can compare whether the average number differs by the race/ethnicity variable.

I'm quite lost as to how to do both of these. Maybe something using group_by and summarize together? I wonder if there's a solution related to the reshape package that would help with the second goal. I'm just not familiar enough with R to get much farther than that, though, so I'd love some help with this.

Here is some sample data with similar characteristics as the actual data:

structure(list(patient_id = c(2L, 1L, 1L, 1L, 4L, 1L, 2L, 5L, 
5L, 1L, 2L, 1L, 2L, 4L, 5L, 2L, 4L, 2L, 1L, 3L, 3L, 1L, 2L, 1L, 
4L, 5L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 3L, 3L, 3L, 1L, 3L, 3L, 4L, 
4L, 1L, 2L, 5L, 5L, 2L, 2L, 2L, 1L, 2L, 2L, 4L, 1L, 3L, 2L, 5L, 
4L, 3L, 3L, 1L, 2L, 1L, 2L, 5L, 5L, 4L, 4L, 4L, 4L, 4L, 2L, 4L, 
4L, 3L, 4L, 5L, 4L, 3L, 4L, 5L, 1L, 5L, 4L, 1L, 3L, 1L, 1L, 3L, 
5L, 2L), diseaseA = c(1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 
0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 
1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 
0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 
0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1), diseaseB = c(0, 1, 
0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 
0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 
0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 
1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 
1, 0, 1, 0), diseaseC = c(0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 
1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 
1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 
1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 
0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0), raceeth = structure(c(2L, 
2L, 2L, 2L, 3L, 2L, 2L, 3L, 3L, 2L, 2L, 2L, 2L, 3L, 3L, 2L, 3L, 
2L, 2L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
1L, 1L, 1L, 2L, 1L, 1L, 3L, 3L, 2L, 2L, 3L, 3L, 2L, 2L, 2L, 2L, 
2L, 2L, 3L, 2L, 1L, 2L, 3L, 3L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 2L, 3L, 3L, 1L, 3L, 3L, 3L, 1L, 3L, 3L, 2L, 
3L, 3L, 2L, 1L, 2L, 2L, 1L, 3L, 2L), .Label = c("Hispanic", "NH White", 
"NH Black"), class = "factor")), row.names = c(NA, 90L), class = "data.frame")

Let me know if there's any additional information I can provide that would help.


Solution

  • I calculated for each patient's diseases (and I think a patient keeps stick to their racceth, that's why grouping by patient_id and raceeth should be ok, I grouped by two variables because I need to keep race column also)

    library(dplyr)
    
    df2 <- df %>%
    group_by(patient_id,raceeth) %>%
    summarise_all(sum) %>%
    ungroup
    
    df2
    

    output;

      patient_id raceeth  diseaseA diseaseB diseaseC
           <int> <fct>       <dbl>    <dbl>    <dbl>
    1          1 NH White       12        8       11
    2          2 NH White       11       11       10
    3          3 Hispanic        6        4        7
    4          4 NH Black       12        8        8
    5          5 NH Black        3        8        6
    

    Now I can calculate averages for each race group ;

    df3 <- df2 %>%
    select(-patient_id) %>%
    group_by(raceeth) %>%
    summarise_all(mean)%>%
    ungroup
    
    df3
    

    output;

      raceeth  diseaseA diseaseB diseaseC
      <fct>       <dbl>    <dbl>    <dbl>
    1 Hispanic      6        4        7  
    2 NH White     11.5      9.5     10.5
    3 NH Black      7.5      8        7