Search code examples
rmappingtidyversepurrrreadxl

Import multiple xlsx files and add variable from cell, while skipping rows (R)


I have multiple Excel files where a key variable for the file is found in cell B2, but the remaining data are in rows and columns below that.

I can extract the "source" variable with the following code:

all_posts <- map_dfr(files, read_xlsx, range = "B2", col_names = "source")

However, I also need to get the additional data in the sheet that is found below B2, with code like so:

all_comments <- map_dfr(files, read_xlsx, skip = 5) #I need to skip five rows to where the data table starts.

How can I do this such that each batch of comments corresponds to the "source" variable found in cell B2? Basically, each Excel file is a "post," named in cell B2, and it contains the corresponding comments for the posts in the rows below that.

Edit: I'm adding a screenshot of one of the Excel files to better help understand my problem. enter image description here


Solution

  • One way I can think of is using nested tibbles. I am not familiar with read_xlsx: I prefer read.xlsx from the openxlsx package, but I think you get the idea.

    library(tidyverse)
    
    tibble(file_name = files) %>%
      mutate(post = map_chr(file_name, ~openxlsx::read.xlsx(.x, rows = 2, cols = 2, colNames = FALSE)[1,1],
             data = map(file_name, ~openxlsx::read.xlsx(.x, startRow = 6))