Search code examples
rgzipread.table

Trying to import a .gz file and encountering errors


I am downloading the data from the following site (https://www.dol.gov/agencies/eta/performance/results).

from here, I scroll down to PY 2022, Q2 and click on WIOA Indvidual Performance Records and download the 562 MB file.

In R, I use the following code,

   library(data.table)
   library(tidyverse)
   library(stringr)
   library(lubridate)


   setwd("file/path")
   WIOA <- read.table(gzfile("WIOA.gz", ""))

and I get the following error

  Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  : 

line 1 did not have 5 elements

I know that this error has been discussed in past questions. However, I can't figure out how to actually look at the data before importing. This would help because then I could give a layout in the import script. Any thoughts on how to get around this?


Solution

  • You can take a look at part of the data using the option 'nrows'; for example, you can start by reading only the first 2 rows:

    WIOA <- read.data(gzfile("WIOA.gz", ""), nrows=2)
    

    Looking at those 2 lines, you can see that the delimiter used in this case was a comma and the first line looks like a header, so you can use that information to read the whole file. I would also recommend to use 'fill=TRUE' to add blanks if the rows have different lenghts and 'quote = ""' to avoid Out of Memory issues (see: https://stackoverflow.com/a/42842260/10932159):

    WIOA <- read.table(gzfile("WIOA.gz", ""), header = TRUE, quote = "", fill=TRUE)
    

    If you are willing to use another library to read the data, vroom is way faster than read.table (roughly 4 times faster):

    library(vroom)
    WIOA <- vroom("WIOA.gz")