I am downloading the data from the following site (https://www.dol.gov/agencies/eta/performance/results).
from here, I scroll down to PY 2022, Q2 and click on WIOA Indvidual Performance Records and download the 562 MB file.
In R, I use the following code,
library(data.table)
library(tidyverse)
library(stringr)
library(lubridate)
setwd("file/path")
WIOA <- read.table(gzfile("WIOA.gz", ""))
and I get the following error
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
line 1 did not have 5 elements
I know that this error has been discussed in past questions. However, I can't figure out how to actually look at the data before importing. This would help because then I could give a layout in the import script. Any thoughts on how to get around this?
You can take a look at part of the data using the option 'nrows'; for example, you can start by reading only the first 2 rows:
WIOA <- read.data(gzfile("WIOA.gz", ""), nrows=2)
Looking at those 2 lines, you can see that the delimiter used in this case was a comma and the first line looks like a header, so you can use that information to read the whole file. I would also recommend to use 'fill=TRUE' to add blanks if the rows have different lenghts and 'quote = ""' to avoid Out of Memory issues (see: https://stackoverflow.com/a/42842260/10932159):
WIOA <- read.table(gzfile("WIOA.gz", ""), header = TRUE, quote = "", fill=TRUE)
If you are willing to use another library to read the data, vroom
is way faster than read.table
(roughly 4 times faster):
library(vroom)
WIOA <- vroom("WIOA.gz")