I tried using pyarrow without success. My code:
df = pd.read_csv("file.csv", engine='pyarrow')
I get this error:
"pyarrow.lib.ArrowInvalid: straddling object straddles two block boundaries (try to increase block size?)"
I cannot find any argument to change the block size. Any suggestions?
To set a block_size
you will need to use PyArrow directly:
from pyarrow import csv
# read CSV using PyArrow with ReadOptions
read_options = csv.ReadOptions(
block_size=1024, # <= define a block size here
)
table = csv.read_csv("file.csv", read_options=read_options)
# convert PyArrow Table to pandas DataFrame
df = table.to_pandas()
You can also use more efficient file formats to store and load large datasets, such as:
Parquet: Parquet stores data column-wise rather than row-wise like CSV (useful when loading part of the columns). It offers compression and efficient encoding. Pandas has methods for creating and reading Parquet files as dataframes.
Pickle: Pickle implements binary protocols for serializing and de-serializing a Python object structure. The files are serialized representations of Python objects and can be much smaller in size compared to text-based formats like CSV. Pandas has methods for creating and reading Pickle files as dataframes.