Search code examples
pythonpandasamazon-s3parquetpyarrow

How to decode dictionary column when using pyarrow to read parquet files?


I have three .snappy.parquet files stored in an s3 bucket, I tried to use pandas.read_parquet() but it only work when I specify one single parquet file, e.g: df = pandas.read_parquet("s3://bucketname/xxx.snappy.parquet"), but if I don't specify the filename df = pandas.read_parquet("s3://bucketname"), this won't work and it gave me error: Seek before start of file.

I did a lot of reading, then I found this page

it suggests that we can use pyarrow to read multiple parquet files, so here's what I tried:

import s3fs
import import pyarrow.parquet as pq
s3 = s3fs.S3FileSystem()
bucket_uri = f's3://bucketname'

data = pq.ParquetDataset(bucket_uri, filesystem=s3)
df = data.read().to_pandas()

This works, but I found that the value for one of the columns in thie df is a dictionary, how can I decode this dictionary and the selected key as column names and value as the corresponding values?

For example, the current column:

column_1
{'Id': 'xxxxx', 'name': 'xxxxx','age': 'xxxxx'....}

The expected column:

Id    age
xxx   xxx
xxx   xxx

Here's the output for data.read().schema:

column_0: string
  -- field metadata --
  PARQUET:field_id: '1'
column_1: struct<Id: string, name: string, age: string,.......>
  child 0, Id: string
    -- field metadata --
    PARQUET:field_id: '3'
  child 1, name: string
    -- field metadata --
    PARQUET:field_id: '7'
  child 2, age: string
    -- field metadata --
    PARQUET:field_id: '8'
  ...........
...........

Solution

  • You have a column with a "struct type" and you want to flatten it. To do so call flatten before calling to_pandas

    import pyarrow as pa
    
    COLUMN1_SCHEMA = pa.struct([('Id', pa.string()), ('Name', pa.string()), ('Age', pa.string())])
    SCHEMA = pa.schema([("column1", COLUMN1_SCHEMA), ('column2', pa.int32())])
    
    df = pd.DataFrame({
        "column1": [("1", "foo", "16"), ("2", "bar", "17"), ],
        "column2": [1, 2],
    })
    pa.Table.from_pandas(df, SCHEMA).to_pandas() #  without flatten
    
    | column1                                 |   column2 |
    |:----------------------------------------|----------:|
    | {'Id': '1', 'Name': 'foo', 'Age': '16'} |         1 |
    | {'Id': '2', 'Name': 'bar', 'Age': '17'} |         2 |
    
    pa.Table.from_pandas(df, SCHEMA).flatten().to_pandas() #  with flatten
    |   column1.Id | column1.Name   |   column1.Age |   column2 |
    |-------------:|:---------------|--------------:|----------:|
    |            1 | foo            |            16 |         1 |
    |            2 | bar            |            17 |         2 |
    

    As a side note, you shoulnd't call it a dictionary column. dictionary is loaded term in pyarrow, and usually refer to distionary encoding

    Edit: how to read a subset of columns in parquet

    import pyarrow.parquet as pq
    
    table = pa.Table.from_pandas(df, SCHEMA)
    pq.write_table(table, 'data.pq')
    # Using read_table:
    pq.read_table('data.pq', columns=['column1.Id', 'column1.Age'])
    # Using ParquetDataSet:
    pq.ParquetDataset('data.pq').read(columns=['column1.Id', 'column1.Age'])