I have a large dataset, in the hundreds of gigabyte range. I am using Polar's LazyFrame scan_csv
to read the file, given it is memory efficient. I need to return any random item quickly at a given time. My first attempt was to use slice
. I was hoping this would be fast, and it is for items near the beginning. But if I encounter items near the end, then it is very slow. Is there a faster way to do this?
Code to reproduce
import polars as pl
df = pl.scan_csv(A_very_large_text_file, has_header=False)
df.slice(index,1).collect().item()
This can quickly retrieve items near the beginning of the file, but slows way down for items near the end.
There's not a quick way to do it with a scan_csv
originated LazyFrame because at some point it has to scan the whole file to get a random row towards the end.
This is a shortcoming of the csv format where the reader can only get to an arbitrary line by scanning through the file line by line looking for the \n
character to denote the end of a particular line.
If you didn't care about knowing which line it is then you could just seek
to a random place in the file, find the end of that line and then take the next full line but polars isn't optimized to do that. Doing this is problematic because lines which follow longer lines will have a greater chance of being selected so depending on the variance in line length and the importance of randomness, this might make this unusable.
Notwithstanding the disclaimer, you could do:
import random
import os
with open(A_very_large_text_file, "r") as ff:
ff.seek(random.choice(range(os.path.getsize(A_very_large_text_file))))
ff.readline() # ignore partial line
randomish_row=pl.DataFrame({f"col{i}":x for i, x in enumerate(ff.readline()[:-1].split(","))})
Alternatively, use pyarrow to convert your csv file into a parquet file with multiple row groups. Then you can create your LazyFrame with scan_parquet
. Since parquet files are highly structured, it can much more efficiently jump to a random part of the file. See here