python pandas parquet pyarrow fastparquet

pyarrow timestamp datatype error on parquet file

I have this error when I read and count records in pandas using pyarrow, I do not want pyarrow to convert to timestamp[ns], it can keep in timestamp[us], is there an option to keep timestamp as is ?, i am using pyarrow 11.0,0 and python 3.10.Please advise

code:

import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.compute as pc
import pandas as pd

# Read the Parquet file into a PyArrow Table
table = pq.read_table('/Users/abc/Downloads/LOAD.parquet').to_pandas()

print(len(table))

error

pyarrow.lib.ArrowInvalid: Casting from timestamp[us] to timestamp[ns] would result in out of bounds timestamp: 101999952000000000

Solution

I do not want pyarrow to convert to timestamp[ns], it can keep in timestamp[us], is there an option to keep timestamp as is ?

At the moment, pandas only support nanosecond timestamp.

If you insist on keeping us precision you have a few options:

not use pandas, stick to pyarrow which supports microseconds:

table = pq.read_table("data.parquet")
len(table)

Use datetime.datetime instead of pd.Timestamp in your dataframe (very slow)

table = pq.read_table("data.parquet")
df = table.to_pandas(timestamp_as_object=True)

Ignore the loss of precision for the timestamps that are out of range

table = pq.read_table("data.parquet")
df = table.to_pandas(safe=False)

But the original timestamp that was 5202-04-02 becomes 1694-12-04

If you're feeling intrepid use pandas 2.0 and pyarrow as a backend for pandas

pip install  pandas==2.0.0rc1

pd.read_parquet("data.parquet", dtype_backend="pyarrow")

Fix the data using pyarrow

Surely 5202-04-02 is a typo. See this question