Search code examples
pythonpandasparquet

Convert parquet to list of objects in python


I am reading a parquet file with panda:

import pandas as pd
df = pd.read_parquet('myfile.parquet', engine='pyarrow')

The file has the following structure:

company_id user_id attribute_name attribute_value timestamp
1 116664 111f07000612 first_name Tom 2022-03-23 17:11:58
2 116664 111f07000612 last_name Cruise 2022-03-23 17:11:58
3 116664 111f07000612 city New York 2022-03-23 17:11:58
4 116664 abcf0700d009d122 first_name Matt 2022-02-23 10:11:59
5 116664 abcf0700d009d122 last_name Damon 2022-02-23 10:11:59

I would like to group by user_id and generate a list of objects (that will be stored as json) with the following format:

[
 {
   "user_id": "111f07000612",
   "first_name": "Tom",
   "last_name": "Cruise",
   "city": "New York"
 },
 {
   "user_id": "abcf0700d009d122",
   "first_name": "Matt",
   "last_name": "Damon"
 }
]

Solution

  • Hi 👋🏻 Hope you are doing well!

    You can achieve it with something similar to this 🙂

    
    from pprint import pprint
    
    import pandas as pd
    
    
    # because I don't have the exact parquet file, I will just mock it
    # df = pd.read_parquet("myfile.parquet", engine="pyarrow")
    df = pd.DataFrame(
        {
            "company_id": [116664, 116664, 116664, 116664, 116664],
            "user_id": ["111f07000612", "111f07000612", "111f07000612", "abcf0700d009d122", "abcf0700d009d122"],
            "attribute_name": ["first_name", "last_name", "city", "first_name", "last_name"],
            "attribute_value": ["Tom", "Cruise", "New York", "Matt", "Damon"],
            "timestamp": ["2022-03-23 17:11:58", "2022-03-23 17:11:58", "2022-03-23 17:11:58", "2022-03-23 17:11:58", "2022-03-23 17:11:58"]
        }
    )
    
    records = []
    
    for user_id, group in df.groupby("user_id"):
        transformed_group = (
            group[["attribute_name", "attribute_value"]]
            .set_index("attribute_name")
            .transpose()
            .assign(user_id=user_id)
        )
        rercord, *_ = transformed_group.to_dict("records")
        records.append(rercord)
    
    pprint(records)
    # [{'city': 'New York',
    #   'first_name': 'Tom',
    #   'last_name': 'Cruise',
    #   'user_id': '111f07000612'},
    #  {'first_name': 'Matt', 'last_name': 'Damon', 'user_id': 'abcf0700d009d122'}]