Search code examples
python-3.xpandasdatetimeparquet

Pandas read_parquet() Error: pyarrow.lib.ArrowInvalid: Casting from timestamp[us] to timestamp[ns] would result in out of bounds timestamp


I am trying to read the 02-2019 fhv data in parquet format found here

https://d37ci6vzurychx.cloudfront.net/trip-data/fhv_tripdata_2019-02.parquet

However when I try to read the data with Pandas

df = pd.read_parquet('fhv_tripdata_2019-02.parquet')

It throws the error:

  File "pyarrow/table.pxi", line 1156, in pyarrow.lib.table_to_blocks
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Casting from timestamp[us] to timestamp[ns] would result in out of bounds timestamp: 33106123800000000

Does anyone know how to print out the offending rows or coerce these values? Make it ignore these rows?


Solution

  • One of the row in that data set has got its dropOff set to 3019-02-03 17:30:00.000000. This is out of bound for pandas.Timestamp. I think it was meant to be 2019-02-03 17:30:00.000000.

    One option is to ignore that error:

    import pyarrow.parquet as pq
    
    df = pq.read_table('fhv_tripdata_2019-02.parquet').to_pandas(safe=False)
    

    But then that wrong timestamp will overflow and have some weird value:

    >>> df['dropOff_datetime'].min()
    Timestamp('1849-12-25 18:20:52.580896768')
    

    Alternatively you can filter out the values that are out of bound in pyarrow:

    import pyarrow as pa
    import pyarrow.parquet as pq
    import pyarrow.compute as pc
    
    table = pq.read_table("fhv_tripdata_2019-02.parquet")
    df = table.filter(
        pc.less_equal(table["dropOff_datetime"], pa.scalar(pd.Timestamp.max))
    ).to_pandas()