Search code examples
juliammap

reading large .csv file mmap


I am trying to open a 3.54GB file

s = open("C:/Users/a.bannerman/Desktop/code/TS_data/big.txt","r")
a = Mmap.mmap(s)

What is now listed:

3802655667-element Vector{UInt8}:
 0x30
 0x31
 0x2f
 0x30
 0x32
 0x2f
 0x32

Now I am trying to process this file line by line - even subset it which I can do


1000-element Vector{UInt8}:
 0x30
 0x31
 0x2f
 0x30
 0x32
 0x2f
 0x32
    ⋮
 0x0a

How can I resolve the data at the memory addresses (does each memory address point to a single row in the .txt file?) ? Granted on this machine if I resolve all into a matrix,dataframe I would run out of memory. At this point Id like to iterate over the memory addresses, extract a single row and populate a matrix, dataframe I build, save each block as a .csv, close, free memory then do the next batch. The data itself has n rows, and like 5 columns.

still curious about the above but this is an answer using CSV.jl

row_size = 10000
for rows in Iterators.partition(CSV.Rows("C:/Users/a.bannerman/Desktop/code/TS_data/big.txt"), row_size)
    df = DataFrame(rows) # resolve dataframe
    # perform operations on this specific chunk of file
end

Solution

  • The CSV.jl does mmap

    From your post it seems that you want to process a huge CSV file row by row. This can be done using CSV.Rows instead of CSV.File.

    julia> @time CSV.Rows("huge_huge_file.csv")
      0.000654 seconds (1.39 k allocations: 37.500 KiB)
    CSV.Rows("huge_huge_file.csv"):
    Size: 10
    Tables.Schema:
     :elemtype  Union{Missing, PosLenString}
     :elemid    Union{Missing, PosLenString}
     ...
    

    Now having a set of rows you can iterate over it like anything else.

    Consider this code:

    df = DataFrame()
    for row in CSV.Rows("huge_huge_file.csv")
        push!(df, row)
        nrow(df) > 5 && break
    end
    

    CSV.jl docs reads:

    CSV.Rows: an alternative approach for consuming delimited data, where the input is only consumed one row at a time, which allows "streaming" the data with a lower memory footprint than CSV.File. Supports many of the same options as CSV.File, except column type handling is a little different. By default, every column type will be essentially Union{Missing, String}, i.e. no automatic type detection is done, but column types can be provided manually. Multithreading is not used while parsing. After constructing a CSV.Rows object, rows can be "streamed" by iterating, where each iteration produces a CSV.Row2 object, which operates similar to CSV.File's CSV.Row type where individual row values can be accessed via row.col1, row[:col1], or row[1]. If each row is processed individually, additional memory can be saved by passing reusebuffer=true, which means a single buffer will be allocated to hold the values of only the currently iterated row. CSV.Rows also supports the Tables.jl interface and can also be passed to valid sink functions.