I have a parquet file and I want to read first n
rows from the file into a pandas data frame.
What I tried:
df = pd.read_parquet(path= 'filepath', nrows = 10)
It did not work and gave me error:
TypeError: read_table() got an unexpected keyword argument 'nrows'
I did try the skiprows
argument as well but that also gave me same error.
Alternatively, I can read the complete parquet file and filter the first n rows, but that will require more computations which I want to avoid.
Is there any way to achieve it?
The accepted answer is out of date. It is now possible to read only the first few lines of a parquet file into pandas, though it is a bit messy and backend dependent.
To read using PyArrow as the backend, follow below:
from pyarrow.parquet import ParquetFile
import pyarrow as pa
pf = ParquetFile('file_name.pq')
first_ten_rows = next(pf.iter_batches(batch_size = 10))
df = pa.Table.from_batches([first_ten_rows]).to_pandas()
Change the line batch_size = 10
to match however many rows you want to read in.