Search code examples
pythonpandasparquet

Pandas : Reading first n rows from parquet file?


I have a parquet file and I want to read first n rows from the file into a pandas data frame. What I tried:

df = pd.read_parquet(path= 'filepath', nrows = 10)

It did not work and gave me error:

TypeError: read_table() got an unexpected keyword argument 'nrows'

I did try the skiprows argument as well but that also gave me same error.

Alternatively, I can read the complete parquet file and filter the first n rows, but that will require more computations which I want to avoid.

Is there any way to achieve it?


Solution

  • The accepted answer is out of date. It is now possible to read only the first few lines of a parquet file into pandas, though it is a bit messy and backend dependent.

    To read using PyArrow as the backend, follow below:

    from pyarrow.parquet import ParquetFile
    import pyarrow as pa 
    
    pf = ParquetFile('file_name.pq') 
    first_ten_rows = next(pf.iter_batches(batch_size = 10)) 
    df = pa.Table.from_batches([first_ten_rows]).to_pandas() 
    

    Change the line batch_size = 10 to match however many rows you want to read in.