I am trying to open a 3.54GB file
s = open("C:/Users/a.bannerman/Desktop/code/TS_data/big.txt","r")
a = Mmap.mmap(s)
What is now listed:
3802655667-element Vector{UInt8}:
0x30
0x31
0x2f
0x30
0x32
0x2f
0x32
Now I am trying to process this file line by line - even subset it which I can do
1000-element Vector{UInt8}:
0x30
0x31
0x2f
0x30
0x32
0x2f
0x32
⋮
0x0a
How can I resolve the data at the memory addresses (does each memory address point to a single row in the .txt file?) ? Granted on this machine if I resolve all into a matrix,dataframe I would run out of memory. At this point Id like to iterate over the memory addresses, extract a single row and populate a matrix, dataframe I build, save each block as a .csv, close, free memory then do the next batch. The data itself has n rows, and like 5 columns.
still curious about the above
but this is an answer using CSV.jl
row_size = 10000
for rows in Iterators.partition(CSV.Rows("C:/Users/a.bannerman/Desktop/code/TS_data/big.txt"), row_size)
df = DataFrame(rows) # resolve dataframe
# perform operations on this specific chunk of file
end
The CSV.jl does mmap
From your post it seems that you want to process a huge CSV file row by row. This can be done using CSV.Rows
instead of CSV.File.
julia> @time CSV.Rows("huge_huge_file.csv")
0.000654 seconds (1.39 k allocations: 37.500 KiB)
CSV.Rows("huge_huge_file.csv"):
Size: 10
Tables.Schema:
:elemtype Union{Missing, PosLenString}
:elemid Union{Missing, PosLenString}
...
Now having a set of rows you can iterate over it like anything else.
Consider this code:
df = DataFrame()
for row in CSV.Rows("huge_huge_file.csv")
push!(df, row)
nrow(df) > 5 && break
end
CSV.jl docs reads:
CSV.Rows: an alternative approach for consuming delimited data, where the input is only consumed one row at a time, which allows "streaming" the data with a lower memory footprint than CSV.File. Supports many of the same options as CSV.File, except column type handling is a little different. By default, every column type will be essentially Union{Missing, String}, i.e. no automatic type detection is done, but column types can be provided manually. Multithreading is not used while parsing. After constructing a CSV.Rows object, rows can be "streamed" by iterating, where each iteration produces a CSV.Row2 object, which operates similar to CSV.File's CSV.Row type where individual row values can be accessed via row.col1, row[:col1], or row[1]. If each row is processed individually, additional memory can be saved by passing reusebuffer=true, which means a single buffer will be allocated to hold the values of only the currently iterated row. CSV.Rows also supports the Tables.jl interface and can also be passed to valid sink functions.