Search code examples
rggplot2subsethistogramtidyr

How do I plot a histogram of ages in a tibble with multiple observations per patient?


I have a tibble with one row per observation. The columns have variables such as ID number, DOB and test results

d1

ID DOB result
a 1940-01-01 15
a 1940-01-01 17
b 1933-05-20 11
b 1933-05-20 20

I want to make a histogram of the age of the patients but I can only get the histogram to show every occurence of the DOB, so I have n = patients * observations per patients data instead of n= patients.

I tried:

ggplot(d1, aes(eeptools::age_calc(dob = as.Date(DOB), enddate = Sys.Date(), units = 'years'))) + geom_histogram(binwidth = 1)

How do I subset so I only get one DOB for each ID? Thanks!


Solution

  • If you are not interested in the results column, then you could simply drop it by using subset and then use the function distinctto remove all duplicates. I am a bit unsure of your years (is it years or year of birth?), but using years as age since today, I got this:

    
    # Import packages
    library(ggplot2)
    library(dplyr)
    
    # Make dataframe
    df <- data.frame(ID = c("a", "a", "b", "b"),
           DOB = c("1940-01-01", "1940-01-01", "1933-05-20", "1933-05-20"),
           result = c(15, 17, 11, 20))
    
    
    #Mutate date to correct class - it most likely already is in your example
    df %>%  mutate(date = as.Date(DOB),
                   years = lubridate::year(date),
                   age = 2023 - years) %>% 
    
    # Subset data to remove results
      subset(select = - result) %>% 
    
    # Remove duplicates using distinct
      distinct() %>% 
      
    # Plot
      ggplot(aes(x=age,)) +
      geom_histogram(bins = 2)