Search code examples
rregexexceldplyrreadxl

Importing and Tidying from Terrible Excel Spreadsheet in R


First, see this url for a small example of the type of data I am formatting. You'll note I have highlighted two areas in the worksheet that I am trying to select. The selection range needs to be dynamic as projects move in and out of the datasets. For the first part, I think this code will suffice:

library(tidyverse)
library(readxl)

filename <- "MyDataset.xlsx"

#obtain first section of my excel spreadsheet
project_codes <- read_excel(
  path = filename,
  sheet = "Jan18",
  range = "A10:B1000",
  col_names = c("proj_num", "name")
) %>% 
  drop_na() %>% 
  filter(grepl("-", project_codes$proj_num))

The second section is where I am getting tripped up... I want to ensure that I have selected the exact same subset of rows as my 'project_codes' in the other highlighted area of the spreadsheet.

I have many worksheets formatted in this exact same way (and naming convention is consistent - Jan18, Feb18, Mar18), so bonus points if someone can assist me in iterating through the worksheets after solving part 1.


Solution

  • library(tidyverse)
    library(readxl)
    
    filename <- "MyDataset.xlsx"
    
    excel_sheets(filename) %>% 
      set_names() %>% 
      map_df(~ read_excel(
      path = filename,
      sheet = .,
      range = "A6:N1000"
    ) %>% 
      filter(str_detect(`#`, "-")) %>% 
      select(`#`, `PROJECT NAME`, `Status`, `Cumulative Billings`, `Billing Adjustment`, `Contract Value`), .id = "Month")
    
    # A tibble: 8 x 7
      Month `#`       `PROJECT NAME` Status `Cumulative Billings` `Billing Adjustment` `Contract Value`
      <chr> <chr>     <chr>           <dbl>                 <dbl>                <dbl>            <dbl>
    1 Jan18 1-11-0010 Project 1         715                  6723                 2138             1977
    2 Jan18 1-11-0011 Project 2        1717                  8330                 8283             2588
    3 Jan18 1-11-0012 Project 3        3332                  2908                 4938             7734
    4 Jan18 1-11-0013 Project 4        2589                  8714                 6034             1476
    5 Jan18 1-11-0014 Project 5         588                  4969                 2161             3334
    6 Jan18 1-11-0015 Project 6         820                  7688                 4243             4293
    7 Jan18 1-11-0020 Project 20       7287                   333                 9100             3078
    8 Jan18 1-11-0030 Project 30       1564                   487                 7249             5508
    

    map_df() will iterate through each sheet (whose names are extracted using excel_sheets()) and create one data frame with a column indicating which sheet the data comes from.

    I had to rely on this answer to apply the name of the sheet as its own column.