To expand on what I said in the title, I have a data set in excel filled with patient IDs, time of blood draw, and other corresponding values like disease diagnosis, etc.
What I want to do is use R
to filter patients by earliest time of blood draw but because the same patient is repeatedly sampled over years there are duplicates. See example below:
Record ID | Study ID | Date of Blood Draw | Experimental value |
---|---|---|---|
613 | 100001 | 1/4/15 | 4094.3 |
613 | 100001 | 1/4/16 | 394.2 |
613 | 100001 | 1/4/17 | 4949.2 |
678 | 100029 | 4/4/13 | 494.3 |
678 | 100029 | 4/4/14 | 293.2 |
789 | 100450 | 5/7/17 | 34932.3 |
789 | 100450 | 5/10/18 | 394.3 |
789 | 100450 | 5/10/20 | 39493.3 |
789 | 100450 | 9/8/21 | 895.4 |
789 | 100450 | 1/20/23 | 5883.5 |
789 | 100450 | 1/1/25 | 495.3 |
790 | 100008 | 9/10/18 | 4959.3 |
790 | 100008 | 5/10/20 | 4959.8 |
801 | 100078 | 9/8/21 | 695.6 |
900 | 100082 | 8/20/23 | 6979.7 |
900 | 100082 | 1/1/25 | 7000 |
There are many, many more columns with different types (numerical and categorical) of data but all I really care about is filtering the dataset by earliest date of blood draw.
Truth be told, I don't know where to start in R
. I was looking for packages and related questions on stackoverflow to see if other people have had a similar experience but so far no luck. I did find a package called admiral
but I'm still puzzling over its details.
I am hoping the final result looks something like this:
Record ID | Study ID | Date of Blood Draw | Experimental value |
---|---|---|---|
613 | 100001 | 1/4/15 | 4094.3 |
678 | 100029 | 4/4/13 | 494.3 |
789 | 100450 | 5/7/17 | 34932.3 |
790 | 100008 | 9/10/18 | 4959.3 |
801 | 100078 | 9/8/21 | 695.6 |
900 | 100082 | 8/20/23 | 7000 |
There is no real rhyme or reason to blood draw dates, especially because the data set I'm using has been built on and added to over a decade or more. So, if anyone knows something or has some advice I would appreciate it a lot!
You can try something like this (but instead of the data I declared, use yours).
# Load necessary libraries
library(dplyr)
library(readr)
library(lubridate)
library(janitor) # For cleaning column names
# Read the dataset from CSV and clean column names
data <- read_csv("dataset.csv") %>%
clean_names() # Converts "Date of Blood Draw" to "date_of_blood_draw"
# Convert the Date of Blood Draw column to Date format
data <- data %>%
mutate(date_of_blood_draw = mdy(date_of_blood_draw)) # Adjust if format is different
# Select the earliest blood draw date per Study ID
filtered_data <- data %>%
group_by(study_id) %>%
slice_min(order_by = date_of_blood_draw, n = 1) %>%
ungroup()
# View or write the cleaned dataset
print(filtered_data) # Print to console
write_csv(filtered_data, "filtered_dataset.csv") # Save as CSV