python pandas pandas-groupby parquet fastparquet

Getting a dataframe from a pandas groupby to write to parquet

I have some csv data with the following columns:

country, region, year, month, price, volume

I need to transform this to something like the following:

country, region, datapoints

Where datapoints consists of either: (year, month, price, volume) tuple array, OR (better)

{ (year, month) : {price, volume} }

Effectively, I'm trying to reshape the data to be a timeseries that can then be stored as parquet. For what it's worth, I'm using fastparquet to write dataframes to parquet files.

Is this possible?

Solution

You can use apply to create the column 'datapoint':

df['datapoint'] = df.apply(lambda row: (row['year'],row['month'],
                                         row['price'],row['volume']),1)

df['datapoint_better'] = df.apply(lambda row: {(row['year'],row['month']):
                                                 {row['price'],row['volume']}},1)

As I said, you can't do {row['year'],row['month']} as a key in a dictionary

And then if you want to get ride of the columns:

df = df.drop(['year','month','price','volume'],1)

EDIT: ok I missed the groupby, anyway, you can create two columns first with keys and item:

df['key'] = df.apply(lambda row: ( row['year'], row['month']),1)
df['item'] = df.apply(lambda row: { row['price'], row['volume']},1)

and then you do the groupby with apply and do the pd.Series.to_dict with this two columns such as:

df_output = (df.groupby(['country','region'])
               .apply(lambda df_grouped: pd.Series(df_grouped.item.values,
                                                   index=df_grouped.key).to_dict())
               .reset_index().rename(columns={0:'datapoints'}))

reset_index and rename is to get the expected output

Note: I would suggest to use tuple as well for the item and not set to prevent any order issues as set is not ordered.