Search code examples
zipjulia

Importing a text CSV file within zip file from ftp url causes bound error (BoundsError)


using HTTP, ZipFile, CSV
datafile="ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/DVS/natality/Nat2018ps.zip"

function rzip(datafile)
    dat = HTTP.get(datafile)
    r = ZipFile.Reader(IOBuffer(dat.body))
    f  = r.files[1]
    CSV.read(f, delim=' ', ignorerepeated=true) 
end

The function rzip reads zipfile and txt file witin it, uses CSV to create dataframe and then reads it into table.

When running it the following error is seen:

julia> rzip(datafile)
ERROR: BoundsError: attempt to access 16-element Array{UInt64,1} at index [-9223372036854775807]

Solution

  • The ZipFile.Reader stream is not a random-access stream so it does not work correctly in the multi-threaded that has been recently introduced in CSV.jl Hence you need to use the threaded=false option.

    using HTTP, ZipFile, CSV
    datafile="ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/DVS/natality/Nat2018ps.zip"
    dat = HTTP.get(datafile)
    r = ZipFile.Reader(IOBuffer(dat.body))
    f  = r.files[1]
    df = CSV.read(f, delim=' ', ignorerepeated=true, threaded=false) 
    

    Now just to show that it works:

    julia> df
    25918×56 DataFrames.DataFrame. Omitted printing of 51 columns  
    │ Row   │ 201801 │ 04272GU │ 010311 │ 1     │ 20083US │
    │       │ Int64  │ String  │ Int64  │ Int64 │ String  │
    ├───────┼────────┼─────────┼────────┼───────┼─────────┤
    │ 1     │ 201801 │ 05592GU │ 10311  │ 1     │ 35116FM │
    │ 2     │ 201801 │ 11362GU │ 10311  │ 1     │ 22083US │        
    ⋮
    │ 25916 │ 201808 │ 01001PR │ 31371  │ 2     │ 22083US │        
    │ 25917 │ 201811 │ 00495PR │ 21311  │ 1     │ 19072US │        
    │ 25918 │ 201806 │ 10221PR │ 127211 │ 1     │ 19072US │