Given a ParquetFile
object (docs) I am able to retrieve data at row group / column chunk level either with read_row_group
or with the metadata
attribute:
from pyarrow import fs
from pyarrow.parquet import ParquetFile
s3 = fs.S3FileSystem(region='us-east-2')
path = 'voltrondata-labs-datasets/nyc-taxi/year=2009/month=1/part-0.parquet'
source = s3.open_input_file(path)
parquet_file = ParquetFile(source)
# row_group metadata
parquet_file.metadata.row_group(0)
<pyarrow._parquet.RowGroupMetaData object at 0x7f8f5edcda40>
num_columns: 22
num_rows: 11624
total_byte_size: 712185
# column_chunk metadata
parquet_file.metadata.row_group(0).column(0)
<pyarrow._parquet.ColumnChunkMetaData object at 0x7f8f5edcda90>
file_offset: 1636
file_path:
physical_type: BYTE_ARRAY
num_values: 11624
path_in_schema: vendor_name
is_stats_set: True
statistics:
<pyarrow._parquet.Statistics object at 0x7f8f5eb74c20>
has_min_max: True
min: CMT
max: VTS
null_count: 0
distinct_count: 0
num_values: 11624
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: SNAPPY
encodings: ('PLAIN_DICTIONARY', 'PLAIN', 'RLE')
has_dictionary_page: True
dictionary_page_offset: 4
data_page_offset: 41
total_compressed_size: 1632
total_uncompressed_size: 1625
But I cannot go further than that. Is it possible to get page related information (page header, repetition levels, definition levels and values) as outlined in parquet docs?
Note: I am interested in this to learn about how parquet files work under the hood. I've had a look at introspection tools (like parquet-tools) but it seems to be deprecated and alternatives only give row group level information.
You cannot access that information in pyarrow today. Pyarrow has initally been focused on converting parquet files to the Arrow representation. There is no equivalent to pages in Arrow. The info should be available in parquet-cpp
(which, confusingly, is a project that also lives in the Arrow GitHub repo) if you're able to dig into C++. It may be possible to get that info in other parquet projects, I am not as familiar with them.