Search code examples
rstringtidyversepurrrdata-import

Importing Data and adding ID specific for different file sources


I have one data frame containing the patient_id's matched with the names of the patients.

Each patient has his/her own data file FirstNameLastName.csv. In order to anonymize the data I wrote the function read_in which will read in each FirstNameLastName.csv and add the specified patient_id to it.

For further analysis I now want to have all anonymized data in one data frame object. I tried this using the map_df() function from the purrr package, however I am having problems matching the ID to each read in .csv file. Could somebody help fix that, such that the result is a data frame containing all the data with the respected ID.

> patient_names
  patient_id        patient_name  
1      1            Tina Turner
2      2            Michael Jackson 
3      3            Michael Jordan  
4      4            Dom Toretto
5      5            Lebron James

read_csv("LebronJames.csv")

Year         Injury                  
<chr>        <chr>                
2020       Sprained Ankle             
1990       Torn ACL       
1995       Bruised Knee       
2011       Sore Neck  
2014       Headache 
2019       Broken Leg 
read_in <- function(path, patient_id= 1){
  data <- read_delim(path, delim= ";",col_names = TRUE)
  data <- add_column(data, patient_id= patient_names[["patient_id"]][id], .before = 1)
}

  patient_id       Year         Injury                  
       <int>       <chr>        <chr>                
 1      5          2020       Sprained Ankle             
 2      5          1990       Torn ACL       
 3      5          1995       Bruised Knee       
 4      5          2011       Sore Neck  
 5      5          2014       Headache 
 6      5          2019       Broken Leg 
list.files(path= "/directory", pattern = ".csv", full.names = TRUE) %>%
  map_df(read_in)

# A tibble: 1234 x 3
    patient_id   Year    Injury
    <int>        <chr>   <chr>        
 1      1        2012    Ankle   
 2      1        2014    Broken Arm 
 3      1        1999    Concussion 
 4      1        1987    Broken Finger
...    ...       ...     ...

Solution

  • Try this approach -

    library(purrr)
    library(readr)
    
    filenames <- paste0(gsub('\\s', '', patient_names$patient_name), '.csv')
    data <- map_df(filenames, read_csv, .id = 'patient_id')
    

    filenames should create a vector of filenames to read from and data should have all the data combined from these csv files with a unique id for each file which is called 'patient_id'.