I would like to understand the different behavior in the following code.
This is using dask/distributed/ubuntu 16.04 fresh conda installation
us=dd.read_parquet("/home/.......",["date","num_25","num_100","num_unq"]).persist()
g=us.groupby("us.date.dt.week)
x=g["num_25","num_100","num_unq"].mean() # Works !
x=client.persist(x) #
x=g["num_25","num_100","num_unq"].var() # NOT WORKING
x=client.persist(x) #
x=g["num_25","num_100","num_unq"].std() # NOT WORKING
x=client.persist(x) #
x=g.num_100.var() # Works
x=client.persist(x)
I can aggregate groups of columns in the example above with mean/min/max.
However,for e.g. std/var I need to disaggregate and made the calculation one column at a time.
In the cases it does not work, the stack reports a key-error ("num_25","num_100","num_unq")
In Pandas/Dask.dataframe you select multiple columns by passing a list of columns.
df[['a', 'b', 'c']].var() # good!
df['a', 'b', 'c'].var() # bad.