I have some csv data with the following columns:
country, region, year, month, price, volume
I need to transform this to something like the following:
country, region, datapoints
Where datapoints consists of either:
(year, month, price, volume)
tuple array, OR (better)
{
(year, month) : {price, volume}
}
Effectively, I'm trying to reshape the data to be a timeseries that can then be stored as parquet. For what it's worth, I'm using fastparquet to write dataframes to parquet files.
Is this possible?
You can use apply
to create the column 'datapoint':
df['datapoint'] = df.apply(lambda row: (row['year'],row['month'],
row['price'],row['volume']),1)
or
df['datapoint_better'] = df.apply(lambda row: {(row['year'],row['month']):
{row['price'],row['volume']}},1)
As I said, you can't do {row['year'],row['month']}
as a key in a dictionary
And then if you want to get ride of the columns:
df = df.drop(['year','month','price','volume'],1)
EDIT: ok I missed the groupby, anyway, you can create two columns first with keys and item:
df['key'] = df.apply(lambda row: ( row['year'], row['month']),1)
df['item'] = df.apply(lambda row: { row['price'], row['volume']},1)
and then you do the groupby
with apply
and do the pd.Series.to_dict
with this two columns such as:
df_output = (df.groupby(['country','region'])
.apply(lambda df_grouped: pd.Series(df_grouped.item.values,
index=df_grouped.key).to_dict())
.reset_index().rename(columns={0:'datapoints'}))
reset_index
and rename
is to get the expected output
Note: I would suggest to use tuple
as well for the item and not set
to prevent any order issues as set
is not ordered.