Search code examples
pythondistributeddask

dask groupby aggregation correct usage


I would like to understand the different behavior in the following code.

This is using dask/distributed/ubuntu 16.04 fresh conda installation

us=dd.read_parquet("/home/.......",["date","num_25","num_100","num_unq"]).persist()
g=us.groupby("us.date.dt.week)

x=g["num_25","num_100","num_unq"].mean()  # Works !
x=client.persist(x)                       #

x=g["num_25","num_100","num_unq"].var()   #  NOT WORKING
x=client.persist(x)                       #

x=g["num_25","num_100","num_unq"].std()   #  NOT WORKING
x=client.persist(x)                       #

x=g.num_100.var()                         #  Works
x=client.persist(x)

I can aggregate groups of columns in the example above with mean/min/max.

However,for e.g. std/var I need to disaggregate and made the calculation one column at a time.

In the cases it does not work, the stack reports a key-error ("num_25","num_100","num_unq")


Solution

  • In Pandas/Dask.dataframe you select multiple columns by passing a list of columns.

    df[['a', 'b', 'c']].var()  # good!
    df['a', 'b', 'c'].var()    # bad.