Search code examples
rexcelcsvreadxl

R: Need help converting excel sheets to csv, while properly converting column types & checking .xls or .xlsx extension


I'm fairly new to R & scripting and I'm trying to write a script that will convert each sheet in an excel file to its own csv file. This script will be used on many excel files that won't all have the same number of sheets, number of rows/columns per sheet, and may have columns with empty values. Also, the script will probably be used by other people at work besides myself, so I don't want to use too many packages that they will have to install.

The following is an outline of what I want the script to do:

  1. Take user input for file name
  2. Check whether the file has extension .xls or .xlsx
  3. Append correct extension (.xls or .xlsx) to file name
  4. Read each sheet in the excel file
  5. Keep column data types the same (I explain more later)
  6. Write each sheet to its own csv file

I searched online & tried various methods to get excel sheets converted to csv files, but I didn’t have much luck with most of the methods I found. The one method that got me most of the way there is Yifu Yan's answer here. However, exactly as he mentioned, the column types did not convert correctly.

I tried modifying Yifu Yan’s code a bit, trying to figure out the column type conversion myself, but was unsuccessful. Here is the code I used (with readxl & purrr packages):

file <- readline("Enter file name: ")
# Use if/else, for, or while loop to check file extension, which will replace “.xlsx” in the following line with the determined extension
path <- paste0(file, “.xlsx”)
sheet_names <- readxl::excel_sheets(path)
df <- purrr::map(sheet_names,  ~readxl::read_excel(path, .x, col_types = "guess", col_names = FALSE))
purrr::walk2(df, sheet_names, write.csv(.x, paste0(file, "-", .y, ".csv"), col_names = FALSE))

When I tested the code with an excel file, I found that some of the columns converted fine, but others didn’t. Here is an example of what the data should have looked like after conversion:

Column 1 Column 2 Column 3 Column 4 Column 5
2233.09 501 1234567 ab 3/20/2011 6:09:17
1122.09 502 1234569 ac 3/20/2011 6:12:25

What it actually looked like:

Column 1 Column 2 Column 3 Column 4 Column 5
2233.0900000000001 501 1234567 ab 40622.25644675926
1122.0900000000002 502 1234569 ac 40622.258622685185

The 3 things I want help with:

  1. How to ensure the column types/values stay exactly the same during the conversion process
  2. What function I can use to check the file extension (I can figure out how to write the loop myself if need be)
  3. If possible, use base instead of purrr. I would prefer if the only non-base package is readxl (or xlsx), but its fine if not

EDIT: Since I couldn't get anything else to work while only using one package (besides tools), I will just use readxl & tidyverse (includes readr & purrr). Not ideal, but does what I need it to.


Solution

  • Column Types

    Number 1 is the hardest and I fear you won't find a one fits all solution. Most of the Excel readers I know try really hard to determine the column type. This is not an easy task, because Excel cells contain arbitrary values, and R must find a suitable heuristic to get the proper cell type. This guessing works in my opinion most of the times and even in cases where it doesn't it is often the case that R is right and just the formatting in Excel is misleading.

    So for example your first column looks totally valid to me. R recognizes the number format and reads in the floating point number.

    In floating point arithmetic (cf. Circle 1 - Falling into the Floating Point Trap the two values are indeed the same:

    all.equal(2233.0900000000001, 2233.09)
    # [1] TRUE
    

    For your date problem it is quite similar. Internally in Excel dates are store as numbers (and times as fractions), in the following screenshot both columns A and B contain the same numbers, but column B hast datetime format:

    Screenshot of an Excel file, in Column A you see number 1, 1.5 and 2. In Column B you see a formula taking the same values as Column A. Column B hast datetime formatting showing that Excel uses numbers internally to represent dates and fratcions to show time

    The conversion is a bit tricky and on e needs to know that Excel dates start at 1899-12-30 (for whatever reason)

    xkcd comic mocking the fact that Excel starts dates on the 30th of December 1899

    as.POSIXct(40622.25644675926 * 24 *60*60, origin = "1899-12-30", tz = "GMT")
    # [1] "2011-03-20 06:09:17 GMT"
    

    With all this subtleties and as you are anyways interested in CSV format (which does not know about types other than numbers or text), why not using text for all columns?

    readxl::read_excel(path, .x, col_types = "text")
    

    File Extension

    You can use tools::file_ext. tools is a base library so should be installed anyways, but the implementation is easy enough:

    function (x) {
      pos <- regexpr("\\.([[:alnum:]]+)$", x)
      ifelse(pos > -1L, substring(x, pos + 1L), "")
    }
    

    Thus, you can easily get the file extension then by tools::file_ext(file).

    base Loops

    you can use lapply as a direct "replacement" (actually it is the other way round, purrr::map is a replacement of lapply) for both map and walk:

    df <- lapply(sheet_names,  
                 \(.x) readxl::read_excel(path, .x, col_types = "text", 
                                          col_names = FALSE))
    lapply(names(df), 
           \(.y) sheet_names, write.csv(df[.y], paste0(file, "-", .y, ".csv"), 
                                        col_names = FALSE))