I am trying to read the 02-2019 fhv data in parquet format found here
https://d37ci6vzurychx.cloudfront.net/trip-data/fhv_tripdata_2019-02.parquet
However when I try to read the data with Pandas
df = pd.read_parquet('fhv_tripdata_2019-02.parquet')
It throws the error:
File "pyarrow/table.pxi", line 1156, in pyarrow.lib.table_to_blocks
File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Casting from timestamp[us] to timestamp[ns] would result in out of bounds timestamp: 33106123800000000
Does anyone know how to print out the offending rows or coerce these values? Make it ignore these rows?
One of the row in that data set has got its dropOff set to 3019-02-03 17:30:00.000000
. This is out of bound for pandas.Timestamp
. I think it was meant to be 2019-02-03 17:30:00.000000
.
One option is to ignore that error:
import pyarrow.parquet as pq
df = pq.read_table('fhv_tripdata_2019-02.parquet').to_pandas(safe=False)
But then that wrong timestamp will overflow and have some weird value:
>>> df['dropOff_datetime'].min()
Timestamp('1849-12-25 18:20:52.580896768')
Alternatively you can filter out the values that are out of bound in pyarrow:
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.compute as pc
table = pq.read_table("fhv_tripdata_2019-02.parquet")
df = table.filter(
pc.less_equal(table["dropOff_datetime"], pa.scalar(pd.Timestamp.max))
).to_pandas()