Importing data using readxl

Issue description

I'm trying to iterate over multiple sheets in a spread sheet to pick up the first row as a column name and rows 11+ as data. I'm hoping to import them as a single dataframe. I'm having trouble because there are 10 header rows in the sheets and I don't seem to be able to aggregate the sheets without losing data.

Data

The file in question is found at Table 6 on this page of the ABS website.

My attempt

The first chunk does the heavy lifting of getting the data into r. The map function naturally results in a list of lists, containing data found in the sheets whose name contained the text "Data" (done this way because there are two sheets in every one of these spread sheets that contain some irrelevant info).

BUT I want the output in a dataframe so I tried to use the map_df function but all data from spreadsheets after the first is imported as NA values (incorrect).

library(tidyverse)
library(stringr)

df1 <- path %>% 
  excel_sheets() %>% 
  str_subset("Data") %>%
  map(read_excel, path = path, skip = 9)

The second chunk picks up the column names in each of the sheets so that they can be applied to df1.

nms <- path %>% 
  excel_sheets() %>% 
  str_subset("Data") %>%
  map_df(read_excel, path = path, n_max = 0, col_names = T) %>% 
  mutate(
    date = 1
  ) %>% 
  select(
    date, everything()
  )

names(df1) <- names(nms)

If anyone could show me how to import the data without the NA's in a single dataframe that would be great. Bonus points for showing me how to do it in a single step, without the need for the second chunk to name the columns.

Solution

Not exactly sure what you're looking for, but if you want to read all the sheets in that workbook keeping and skipping the first 9 rows. Then you just need to stitch these all together through a reduce using left_join to get rid of the NA values.

df1 <- path %>% 
  excel_sheets() %>% 
  str_subset("Data") %>%
  map(~read_excel(file, .x, skip = 9)) %>% 
  reduce(left_join, by = "Series ID")

If you want to keep the original header names:

path %>%
  excel_sheets() %>%
  str_subset("Data") %>%
  map(~read_excel(file, .x, col_names = FALSE) %>%
        set_names(., c("Series ID", .[1, 2:ncol(.)])) %>%
        slice(-1:-10)) %>%
  reduce(left_join, by = "Series ID") %>% 
  mutate_at(vars(-`Series ID`), as.numeric)