Search code examples
pythonpandascsviopyarrow

How to read a huge csv faster?


I tried using pyarrow without success. My code:

df = pd.read_csv("file.csv", engine='pyarrow')

I get this error:

"pyarrow.lib.ArrowInvalid: straddling object straddles two block boundaries (try to increase block size?)"

I cannot find any argument to change the block size. Any suggestions?


Solution

  • To set a block_size you will need to use PyArrow directly:

    from pyarrow import csv
    
    # read CSV using PyArrow with ReadOptions
    read_options = csv.ReadOptions(
        block_size=1024,  # <= define a block size here
    )
    
    table = csv.read_csv("file.csv", read_options=read_options)
    
    # convert PyArrow Table to pandas DataFrame
    df = table.to_pandas()
    

    You can also use more efficient file formats to store and load large datasets, such as:

    • Parquet: Parquet stores data column-wise rather than row-wise like CSV (useful when loading part of the columns). It offers compression and efficient encoding. Pandas has methods for creating and reading Parquet files as dataframes.

    • Pickle: Pickle implements binary protocols for serializing and de-serializing a Python object structure. The files are serialized representations of Python objects and can be much smaller in size compared to text-based formats like CSV. Pandas has methods for creating and reading Pickle files as dataframes.