I am reading a parquet file with panda:
import pandas as pd
df = pd.read_parquet('myfile.parquet', engine='pyarrow')
The file has the following structure:
company_id | user_id | attribute_name | attribute_value | timestamp | |
---|---|---|---|---|---|
1 | 116664 | 111f07000612 | first_name | Tom | 2022-03-23 17:11:58 |
2 | 116664 | 111f07000612 | last_name | Cruise | 2022-03-23 17:11:58 |
3 | 116664 | 111f07000612 | city | New York | 2022-03-23 17:11:58 |
4 | 116664 | abcf0700d009d122 | first_name | Matt | 2022-02-23 10:11:59 |
5 | 116664 | abcf0700d009d122 | last_name | Damon | 2022-02-23 10:11:59 |
I would like to group by user_id and generate a list of objects (that will be stored as json) with the following format:
[
{
"user_id": "111f07000612",
"first_name": "Tom",
"last_name": "Cruise",
"city": "New York"
},
{
"user_id": "abcf0700d009d122",
"first_name": "Matt",
"last_name": "Damon"
}
]
Hi 👋🏻 Hope you are doing well!
You can achieve it with something similar to this 🙂
from pprint import pprint
import pandas as pd
# because I don't have the exact parquet file, I will just mock it
# df = pd.read_parquet("myfile.parquet", engine="pyarrow")
df = pd.DataFrame(
{
"company_id": [116664, 116664, 116664, 116664, 116664],
"user_id": ["111f07000612", "111f07000612", "111f07000612", "abcf0700d009d122", "abcf0700d009d122"],
"attribute_name": ["first_name", "last_name", "city", "first_name", "last_name"],
"attribute_value": ["Tom", "Cruise", "New York", "Matt", "Damon"],
"timestamp": ["2022-03-23 17:11:58", "2022-03-23 17:11:58", "2022-03-23 17:11:58", "2022-03-23 17:11:58", "2022-03-23 17:11:58"]
}
)
records = []
for user_id, group in df.groupby("user_id"):
transformed_group = (
group[["attribute_name", "attribute_value"]]
.set_index("attribute_name")
.transpose()
.assign(user_id=user_id)
)
rercord, *_ = transformed_group.to_dict("records")
records.append(rercord)
pprint(records)
# [{'city': 'New York',
# 'first_name': 'Tom',
# 'last_name': 'Cruise',
# 'user_id': '111f07000612'},
# {'first_name': 'Matt', 'last_name': 'Damon', 'user_id': 'abcf0700d009d122'}]