As seen below, the "dob" field was of type timestamp([s])
when written to Parquet format with pq.write_metadata
. But upon rereading the metadata, the type changed to timestamp[ms]
Python 3.11.1 (main, Jan 26 2023, 10:38:20) [GCC 8.5.0 20210514 (Red Hat 8.5.0-18)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow as pa, pyarrow.parquet as pq
>>> schema = pa.schema([ pa.field("dob", pa.timestamp('s')) ])
>>> schema
dob: timestamp[s]
>>> pq.write_metadata(schema, '_common_schema')
>>> reloaded_schema = pq.read_schema('_common_schema')
>>> reloaded_schema
dob: timestamp[ms]
>>>
Is this because Parquet format does not support Timestamp of unit second?
How can I make the schema exactly the same in this case?
The behavior you're observing is likely due to the fact that the default Timestamp unit in Pyarrow is microseconds (us
), whereas the default Timestamp unit in Parquet is milliseconds (ms
). When you write a Pyarrow schema with a Timestamp unit of s
to a Parquet file, it gets converted to ms
upon storage. When you reload the file, the stored ms
unit is used, so the schema gets reloaded as ms
. To avoid this behavior, you can specify the Timestamp unit in Pyarrow as ms
when writing to Parquet and then ensure that the same unit is used when reading the file back.
The Parquet format does not support Timestamp of unit second (s
). Instead, the default unit for Timestamp in Parquet is milliseconds (ms
). This means that when a Pyarrow schema with a Timestamp field of unit second is written to a Parquet file, it is automatically converted to a Timestamp field of unit milliseconds upon storage. When the file is reloaded, the stored Timestamp unit of milliseconds is used, so the reloaded schema will show the Timestamp field as having a unit of milliseconds.
You can use:
import pyarrow as pa
import pyarrow.parquet as pq
# Specify the Timestamp unit as milliseconds
schema = pa.schema([ pa.field("dob", pa.timestamp('ms')) ])
# Write the schema to a Parquet metadata file
pq.write_metadata(schema, '_common_schema')
# Read the schema back from the metadata file
reloaded_schema = pq.read_schema('_common_schema')
# The reloaded schema should now show the Timestamp field as having a unit of milliseconds
print(reloaded_schema)
This should result in the expected behavior, where the Timestamp field is correctly represented as having a unit of milliseconds, both when written to the Parquet file and when reloaded from the file.
there are a few other data types that can be represented differently in Arrow and Parquet. Here are some to be aware of:
Decimal: The precision and scale of Decimal fields in Arrow and Parquet can be different. When converting from Arrow to Parquet, the decimal type is rounded to the nearest representable decimal with the same scale. When converting from Parquet to Arrow, the decimal type is rounded up to the nearest representable decimal with the same precision.
Timestamp: As we have seen, the default unit for Timestamps in Arrow is microseconds (us), whereas the default unit for Timestamps in Parquet is milliseconds (ms). You should ensure that the correct unit is specified when converting between the two formats.
Time: The default unit for Time in Arrow is microseconds (us), whereas the default unit for Time in Parquet is milliseconds (ms). You should ensure that the correct unit is specified when converting between the two formats.
Nested structures: Arrow supports nested structures, such as arrays and structs, whereas Parquet only supports flat structures. When converting from Arrow to Parquet, any nested structures must be flattened. When converting from Parquet to Arrow, the flat structure must be reconstructed into nested structures.
These are some of the main differences to be aware of when converting between Arrow and Parquet data formats. It's important to ensure that the data is correctly represented in both formats to avoid unexpected behavior and data loss