I'm relatively new to R, so I apologize if this question has already been answered and I was just unable to find it because I don't know what to search for.
I have a very large health-related dataset where there is a row for each doctor's visit that occurred in some specific period of time. One of the variables in the dataset is a patient ID. It's clear that there's a huge amount of duplication of patient IDs in the dataset--I have about half a million unique patient IDs and about 9 million observations. Other variables in the data are diseaseA, diseaseB, and diseaseC. Each of these is a 0 or 1--0 meaning the patient did not have the disease at the time of the visit and 1 meaning they did have the disease. One last important variable is race/ethnicity.
What I ultimately want is:
I'm quite lost as to how to do both of these. Maybe something using group_by and summarize together? I wonder if there's a solution related to the reshape package that would help with the second goal. I'm just not familiar enough with R to get much farther than that, though, so I'd love some help with this.
Here is some sample data with similar characteristics as the actual data:
structure(list(patient_id = c(2L, 1L, 1L, 1L, 4L, 1L, 2L, 5L,
5L, 1L, 2L, 1L, 2L, 4L, 5L, 2L, 4L, 2L, 1L, 3L, 3L, 1L, 2L, 1L,
4L, 5L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 3L, 3L, 3L, 1L, 3L, 3L, 4L,
4L, 1L, 2L, 5L, 5L, 2L, 2L, 2L, 1L, 2L, 2L, 4L, 1L, 3L, 2L, 5L,
4L, 3L, 3L, 1L, 2L, 1L, 2L, 5L, 5L, 4L, 4L, 4L, 4L, 4L, 2L, 4L,
4L, 3L, 4L, 5L, 4L, 3L, 4L, 5L, 1L, 5L, 4L, 1L, 3L, 1L, 1L, 3L,
5L, 2L), diseaseA = c(1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1,
0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1,
0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1,
0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1), diseaseB = c(0, 1,
0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0,
0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0,
0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,
1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0,
1, 0, 1, 0), diseaseC = c(0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1,
1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,
1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0,
1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1,
0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0), raceeth = structure(c(2L,
2L, 2L, 2L, 3L, 2L, 2L, 3L, 3L, 2L, 2L, 2L, 2L, 3L, 3L, 2L, 3L,
2L, 2L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
1L, 1L, 1L, 2L, 1L, 1L, 3L, 3L, 2L, 2L, 3L, 3L, 2L, 2L, 2L, 2L,
2L, 2L, 3L, 2L, 1L, 2L, 3L, 3L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 2L, 3L, 3L, 1L, 3L, 3L, 3L, 1L, 3L, 3L, 2L,
3L, 3L, 2L, 1L, 2L, 2L, 1L, 3L, 2L), .Label = c("Hispanic", "NH White",
"NH Black"), class = "factor")), row.names = c(NA, 90L), class = "data.frame")
Let me know if there's any additional information I can provide that would help.
I calculated for each patient's diseases (and I think a patient keeps stick to their racceth, that's why grouping by patient_id and raceeth should be ok, I grouped by two variables because I need to keep race column also)
library(dplyr)
df2 <- df %>%
group_by(patient_id,raceeth) %>%
summarise_all(sum) %>%
ungroup
df2
output;
patient_id raceeth diseaseA diseaseB diseaseC
<int> <fct> <dbl> <dbl> <dbl>
1 1 NH White 12 8 11
2 2 NH White 11 11 10
3 3 Hispanic 6 4 7
4 4 NH Black 12 8 8
5 5 NH Black 3 8 6
Now I can calculate averages for each race group ;
df3 <- df2 %>%
select(-patient_id) %>%
group_by(raceeth) %>%
summarise_all(mean)%>%
ungroup
df3
output;
raceeth diseaseA diseaseB diseaseC
<fct> <dbl> <dbl> <dbl>
1 Hispanic 6 4 7
2 NH White 11.5 9.5 10.5
3 NH Black 7.5 8 7