Search code examples

Getting a dataframe from a pandas groupby to write to parquet

I have some csv data with the following columns:

country, region, year, month, price, volume

I need to transform this to something like the following:

country, region, datapoints

Where datapoints consists of either: (year, month, price, volume) tuple array, OR (better)

{ (year, month) : {price, volume} }

Effectively, I'm trying to reshape the data to be a timeseries that can then be stored as parquet. For what it's worth, I'm using fastparquet to write dataframes to parquet files.

Is this possible?


  • You can use apply to create the column 'datapoint':

    df['datapoint'] = df.apply(lambda row: (row['year'],row['month'],


    df['datapoint_better'] = df.apply(lambda row: {(row['year'],row['month']):

    As I said, you can't do {row['year'],row['month']} as a key in a dictionary

    And then if you want to get ride of the columns:

    df = df.drop(['year','month','price','volume'],1)

    EDIT: ok I missed the groupby, anyway, you can create two columns first with keys and item:

    df['key'] = df.apply(lambda row: ( row['year'], row['month']),1)
    df['item'] = df.apply(lambda row: { row['price'], row['volume']},1)

    and then you do the groupby with apply and do the pd.Series.to_dict with this two columns such as:

    df_output = (df.groupby(['country','region'])
                   .apply(lambda df_grouped: pd.Series(df_grouped.item.values,

    reset_index and rename is to get the expected output

    Note: I would suggest to use tuple as well for the item and not set to prevent any order issues as set is not ordered.