How can I filter a data set according to earliest time point for each participant?

To expand on what I said in the title, I have a data set in excel filled with patient IDs, time of blood draw, and other corresponding values like disease diagnosis, etc.

What I want to do is use R to filter patients by earliest time of blood draw but because the same patient is repeatedly sampled over years there are duplicates. See example below:

Record ID	Study ID	Date of Blood Draw	Experimental value
613	100001	1/4/15	4094.3
613	100001	1/4/16	394.2
613	100001	1/4/17	4949.2
678	100029	4/4/13	494.3
678	100029	4/4/14	293.2
789	100450	5/7/17	34932.3
789	100450	5/10/18	394.3
789	100450	5/10/20	39493.3
789	100450	9/8/21	895.4
789	100450	1/20/23	5883.5
789	100450	1/1/25	495.3
790	100008	9/10/18	4959.3
790	100008	5/10/20	4959.8
801	100078	9/8/21	695.6
900	100082	8/20/23	6979.7
900	100082	1/1/25	7000

There are many, many more columns with different types (numerical and categorical) of data but all I really care about is filtering the dataset by earliest date of blood draw.

Truth be told, I don't know where to start in R. I was looking for packages and related questions on stackoverflow to see if other people have had a similar experience but so far no luck. I did find a package called admiral but I'm still puzzling over its details.

I am hoping the final result looks something like this:

Record ID	Study ID	Date of Blood Draw	Experimental value
613	100001	1/4/15	4094.3
678	100029	4/4/13	494.3
789	100450	5/7/17	34932.3
790	100008	9/10/18	4959.3
801	100078	9/8/21	695.6
900	100082	8/20/23	7000

There is no real rhyme or reason to blood draw dates, especially because the data set I'm using has been built on and added to over a decade or more. So, if anyone knows something or has some advice I would appreciate it a lot!

Solution

You can try something like this (but instead of the data I declared, use yours).

# Load necessary libraries
library(dplyr)
library(readr)
library(lubridate)
library(janitor)  # For cleaning column names

# Read the dataset from CSV and clean column names
data <- read_csv("dataset.csv") %>%
  clean_names()  # Converts "Date of Blood Draw" to "date_of_blood_draw"
    
# Convert the Date of Blood Draw column to Date format
data <- data %>%
  mutate(date_of_blood_draw = mdy(date_of_blood_draw))  # Adjust if format is different

# Select the earliest blood draw date per Study ID
filtered_data <- data %>%
  group_by(study_id) %>%
  slice_min(order_by = date_of_blood_draw, n = 1) %>%
  ungroup()

# View or write the cleaned dataset
print(filtered_data)  # Print to console
write_csv(filtered_data, "filtered_dataset.csv")  # Save as CSV