I have a dataset in R with information about individuals and diagnoses. The variables are group, age, weight, id and diagnosis. So an individual can have one row with X in diagnosis (meaning no diagnosis) or one or more rows with diagnoses. Now I want to make a new variable with the number of diagnoses each individual got so that each individual has one row in the dataset with the variables group, age, weight, id and number of diagnoses. In this new column with diagnosis I want individuals with no diagnosis to get the number 0, with one diagnosis the number 1, with two diagnoses the number 2 and etcetera. Can anyone help me?
I am using R. I tried to use group_by and count but I can not get the number 0 for individuals with no diagnosis (X in the diagnosis column) and I can not see the other variables like group, age and weight.
Here is the data:
pr <- read_csv("~/Desktop/Data.csv")
head(pr)
dput(pr)
structure(list(GROUP = c(1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3,
3, 3, 3, 3, 4, 4, 4), AGE = c(23, 34, 61, 23, 45, 34, 34, 55,
56, 43, 56, 49, 61, 49, 74, 49, 51, 46, 75), WEIGHT = c(56, 72,
70, 56, 101, 72, 72, 62, 60, 78, 60, 55, 79, 55, 89, 55, 67,
60, 105), ID = c(4, 1, 2, 4, 3, 1, 1, 5, 7, 6, 7, 8, 9, 8, 10,
8, 11, 12, 13), DIAGNOSIS = c("J01", "J01", "X", "J01", "J01",
"J01", "J01", "J01", "J01", "J01", "J01", "J01", "X", "J01",
"J01", "J01", "X", "J01", "J01")), class = c("spec_tbl_df", "tbl_df",
"tbl", "data.frame"), row.names = c(NA, -19L), spec = structure(list(
cols = list(GROUP = structure(list(), class = c("collector_double",
"collector")), AGE = structure(list(), class = c("collector_double",
"collector")), WEIGHT = structure(list(), class = c("collector_double",
"collector")), ID = structure(list(), class = c("collector_double",
"collector")), DIAGNOSIS = structure(list(), class = c("collector_character",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1), class = "col_spec"))
Picture of the desired output: Desired output
One way to approach this is to group_by
multiple columns, if the information is repeated for a given individual (which it does in this example). You will get these columns in your results in the end. Also, you can summarise
where the DIAGNOSIS
is not "X" - instead of count
, so that you will get zero for cases where DIAGNOSIS
is "X".
library(dplyr)
pr %>%
group_by(GROUP, ID, AGE, WEIGHT) %>%
summarise(NUMBER = sum(DIAGNOSIS != "X"))
Output
GROUP ID AGE WEIGHT NUMBER
<dbl> <dbl> <dbl> <dbl> <int>
1 1 1 34 72 3
2 1 2 61 70 0
3 1 3 45 101 1
4 1 4 23 56 2
5 2 5 55 62 1
6 2 6 43 78 1
7 2 7 56 60 2
8 3 8 49 55 3
9 3 9 61 79 0
10 3 10 74 89 1
11 4 11 51 67 0
12 4 12 46 60 1
13 4 13 75 105 1