Search code examples
pythonparquetpyarrow

How do I get page level data of a parquet file with pyarrow?


Given a ParquetFile object (docs) I am able to retrieve data at row group / column chunk level either with read_row_group or with the metadata attribute:

from pyarrow import fs
from pyarrow.parquet import ParquetFile

s3  = fs.S3FileSystem(region='us-east-2')
path = 'voltrondata-labs-datasets/nyc-taxi/year=2009/month=1/part-0.parquet'
source = s3.open_input_file(path)
parquet_file = ParquetFile(source)

# row_group metadata
parquet_file.metadata.row_group(0)
<pyarrow._parquet.RowGroupMetaData object at 0x7f8f5edcda40>
 num_columns: 22
 num_rows: 11624
 total_byte_size: 712185

# column_chunk metadata
parquet_file.metadata.row_group(0).column(0)
<pyarrow._parquet.ColumnChunkMetaData object at 0x7f8f5edcda90>
  file_offset: 1636
  file_path: 
  physical_type: BYTE_ARRAY
  num_values: 11624
  path_in_schema: vendor_name
  is_stats_set: True
  statistics:
    <pyarrow._parquet.Statistics object at 0x7f8f5eb74c20>
      has_min_max: True
      min: CMT
      max: VTS
      null_count: 0
      distinct_count: 0
      num_values: 11624
      physical_type: BYTE_ARRAY
      logical_type: String
      converted_type (legacy): UTF8
  compression: SNAPPY
  encodings: ('PLAIN_DICTIONARY', 'PLAIN', 'RLE')
  has_dictionary_page: True
  dictionary_page_offset: 4
  data_page_offset: 41
  total_compressed_size: 1632
  total_uncompressed_size: 1625

But I cannot go further than that. Is it possible to get page related information (page header, repetition levels, definition levels and values) as outlined in parquet docs?

Diagram showing hierarchical components of a parquet file

Note: I am interested in this to learn about how parquet files work under the hood. I've had a look at introspection tools (like parquet-tools) but it seems to be deprecated and alternatives only give row group level information.


Solution

  • You cannot access that information in pyarrow today. Pyarrow has initally been focused on converting parquet files to the Arrow representation. There is no equivalent to pages in Arrow. The info should be available in parquet-cpp (which, confusingly, is a project that also lives in the Arrow GitHub repo) if you're able to dig into C++. It may be possible to get that info in other parquet projects, I am not as familiar with them.