I have multiple Excel files where a key variable for the file is found in cell B2, but the remaining data are in rows and columns below that.
I can extract the "source" variable with the following code:
all_posts <- map_dfr(files, read_xlsx, range = "B2", col_names = "source")
However, I also need to get the additional data in the sheet that is found below B2, with code like so:
all_comments <- map_dfr(files, read_xlsx, skip = 5) #I need to skip five rows to where the data table starts.
How can I do this such that each batch of comments corresponds to the "source" variable found in cell B2? Basically, each Excel file is a "post," named in cell B2, and it contains the corresponding comments for the posts in the rows below that.
Edit: I'm adding a screenshot of one of the Excel files to better help understand my problem.
One way I can think of is using nested tibbles. I am not familiar with read_xlsx
: I prefer read.xlsx
from the openxlsx
package, but I think you get the idea.
library(tidyverse)
tibble(file_name = files) %>%
mutate(post = map_chr(file_name, ~openxlsx::read.xlsx(.x, rows = 2, cols = 2, colNames = FALSE)[1,1],
data = map(file_name, ~openxlsx::read.xlsx(.x, startRow = 6))