Search code examples
rexceldatabase

How can I filter a data set according to earliest time point for each participant?


To expand on what I said in the title, I have a data set in excel filled with patient IDs, time of blood draw, and other corresponding values like disease diagnosis, etc.

What I want to do is use R to filter patients by earliest time of blood draw but because the same patient is repeatedly sampled over years there are duplicates. See example below:

Record ID Study ID Date of Blood Draw Experimental value
613 100001 1/4/15 4094.3
613 100001 1/4/16 394.2
613 100001 1/4/17 4949.2
678 100029 4/4/13 494.3
678 100029 4/4/14 293.2
789 100450 5/7/17 34932.3
789 100450 5/10/18 394.3
789 100450 5/10/20 39493.3
789 100450 9/8/21 895.4
789 100450 1/20/23 5883.5
789 100450 1/1/25 495.3
790 100008 9/10/18 4959.3
790 100008 5/10/20 4959.8
801 100078 9/8/21 695.6
900 100082 8/20/23 6979.7
900 100082 1/1/25 7000

There are many, many more columns with different types (numerical and categorical) of data but all I really care about is filtering the dataset by earliest date of blood draw.

Truth be told, I don't know where to start in R. I was looking for packages and related questions on stackoverflow to see if other people have had a similar experience but so far no luck. I did find a package called admiral but I'm still puzzling over its details.

I am hoping the final result looks something like this:

Record ID Study ID Date of Blood Draw Experimental value
613 100001 1/4/15 4094.3
678 100029 4/4/13 494.3
789 100450 5/7/17 34932.3
790 100008 9/10/18 4959.3
801 100078 9/8/21 695.6
900 100082 8/20/23 7000

There is no real rhyme or reason to blood draw dates, especially because the data set I'm using has been built on and added to over a decade or more. So, if anyone knows something or has some advice I would appreciate it a lot!


Solution

  • You can try something like this (but instead of the data I declared, use yours).

    # Load necessary libraries
    library(dplyr)
    library(readr)
    library(lubridate)
    library(janitor)  # For cleaning column names
    
    # Read the dataset from CSV and clean column names
    data <- read_csv("dataset.csv") %>%
      clean_names()  # Converts "Date of Blood Draw" to "date_of_blood_draw"
        
    # Convert the Date of Blood Draw column to Date format
    data <- data %>%
      mutate(date_of_blood_draw = mdy(date_of_blood_draw))  # Adjust if format is different
    
    # Select the earliest blood draw date per Study ID
    filtered_data <- data %>%
      group_by(study_id) %>%
      slice_min(order_by = date_of_blood_draw, n = 1) %>%
      ungroup()
    
    # View or write the cleaned dataset
    print(filtered_data)  # Print to console
    write_csv(filtered_data, "filtered_dataset.csv")  # Save as CSV