Search code examples
rreadxl

Mass import and shaping of data in R


SOF newbie here, so please bear with me and some daft questions!

Can anyone please tell me what the most efficient way would be to read in ~50+ excel files with multiple tabs into R, select a specific tab from all 50, and combine those tabs by date to create a large dataframe?

For example, let's assume I want to see what GP activity has been like in a specific healthcare region here in the UK for the last 4 years. I currently have to download many many monthly reports like this one Appointments in General Practice, March 2022

I then have to go to table 3a, scroll down to the NHS area code "07k", and then paste the activity from that row into a separate excel sheet.

I have to do this a lot, and it takes hours, but am not coding literate enough to figure out a better way of doing this.

From the guidance of others, my attempts to date have involved downloading many months' data to a desktop folder. I set my working drive to that folder, and then attempt the following:

files <- list.files(pattern = "*.xls", full.names = TRUE)

for(i in files){
filepath <-    file.path("C:/Users/myname/Desktop/recordsfolder", i)
assign(i, readxl::read_xls(file path))
}

But the honest truth is I don't really understand this, nor what the next step would be.

If anyone could point me in a better direction I'd be very grateful.


Solution

  • Okay so there is a function readData that reads and filters each of the table 3a's from the files in list.files. You then apply this function to all of the files using map and then bind_rows to reduce the list of datasets to a single dataset.

    I dont know how standardised your reports are so you may need to adjust the read function accordingly

    library(readxl)
    library(tidyverse)
    
    readData <- function(path){
      colnames <- read_excel(path, sheet = "Table 3a", skip = 10, n_max = 1) %>% 
        gather(header1, header2) %>% 
        mutate(across(everything(), ~gsub("\\.\\.\\.\\d+", "", .))) %>% 
        mutate(colname = ifelse(header1=="", header2, paste0(header1, "_", header2))) %>% 
        pull(colname)
      read_excel(path, sheet = "Table 3a", skip = 13, col_names = colnames) %>% 
        filter(grepl("07k", `NHS Area Code`, ignore.case = TRUE))
      }
    
    files <- list.files("C:/Users/myname/Desktop/recordsfolder", full.name = TRUE)
    
    files %>% 
      map(readData) %>% 
      bind_rows()