Search code examples
rcounttidyversecalculated-columns

How can I create a new column in R with a count of observations in another column in the dataset?


I have a dataset in R with information about individuals and diagnoses. The variables are group, age, weight, id and diagnosis. So an individual can have one row with X in diagnosis (meaning no diagnosis) or one or more rows with diagnoses. Now I want to make a new variable with the number of diagnoses each individual got so that each individual has one row in the dataset with the variables group, age, weight, id and number of diagnoses. In this new column with diagnosis I want individuals with no diagnosis to get the number 0, with one diagnosis the number 1, with two diagnoses the number 2 and etcetera. Can anyone help me?

I am using R. I tried to use group_by and count but I can not get the number 0 for individuals with no diagnosis (X in the diagnosis column) and I can not see the other variables like group, age and weight.

Here is the data:

pr <- read_csv("~/Desktop/Data.csv")
head(pr)

Data

dput(pr)

structure(list(GROUP = c(1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 
3, 3, 3, 3, 4, 4, 4), AGE = c(23, 34, 61, 23, 45, 34, 34, 55, 
56, 43, 56, 49, 61, 49, 74, 49, 51, 46, 75), WEIGHT = c(56, 72, 
70, 56, 101, 72, 72, 62, 60, 78, 60, 55, 79, 55, 89, 55, 67, 
60, 105), ID = c(4, 1, 2, 4, 3, 1, 1, 5, 7, 6, 7, 8, 9, 8, 10, 
8, 11, 12, 13), DIAGNOSIS = c("J01", "J01", "X", "J01", "J01", 
"J01", "J01", "J01", "J01", "J01", "J01", "J01", "X", "J01", 
"J01", "J01", "X", "J01", "J01")), class = c("spec_tbl_df", "tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -19L), spec = structure(list(
    cols = list(GROUP = structure(list(), class = c("collector_double", 
    "collector")), AGE = structure(list(), class = c("collector_double", 
    "collector")), WEIGHT = structure(list(), class = c("collector_double", 
    "collector")), ID = structure(list(), class = c("collector_double", 
    "collector")), DIAGNOSIS = structure(list(), class = c("collector_character", 
    "collector"))), default = structure(list(), class = c("collector_guess", 
    "collector")), skip = 1), class = "col_spec"))

Picture of the desired output: Desired output


Solution

  • One way to approach this is to group_by multiple columns, if the information is repeated for a given individual (which it does in this example). You will get these columns in your results in the end. Also, you can summarise where the DIAGNOSIS is not "X" - instead of count, so that you will get zero for cases where DIAGNOSIS is "X".

    library(dplyr)
    
    pr %>%
      group_by(GROUP, ID, AGE, WEIGHT) %>%
      summarise(NUMBER = sum(DIAGNOSIS != "X"))
    

    Output

       GROUP    ID   AGE WEIGHT NUMBER
       <dbl> <dbl> <dbl>  <dbl>  <int>
     1     1     1    34     72      3
     2     1     2    61     70      0
     3     1     3    45    101      1
     4     1     4    23     56      2
     5     2     5    55     62      1
     6     2     6    43     78      1
     7     2     7    56     60      2
     8     3     8    49     55      3
     9     3     9    61     79      0
    10     3    10    74     89      1
    11     4    11    51     67      0
    12     4    12    46     60      1
    13     4    13    75    105      1