Search code examples
pythonpandasdatabricksspark-koalas

koalas groupby -> apply returns 'cannot insert "key", already exists'


I've been struggling with this issue and haven't been able to solve it, I got the current dataframe:

import databricks.koalas as ks

x = ks.DataFrame.from_records(
{'ds': {0: Timestamp('2018-10-06 00:00:00'),
  1: Timestamp('2017-06-08 00:00:00'),
  2: Timestamp('2018-10-22 00:00:00'),
  3: Timestamp('2017-02-08 00:00:00'),
  4: Timestamp('2019-02-03 00:00:00'),
  5: Timestamp('2019-02-26 00:00:00'),
  6: Timestamp('2017-04-15 00:00:00'),
  7: Timestamp('2017-07-02 00:00:00'),
  8: Timestamp('2017-04-04 00:00:00'),
  9: Timestamp('2017-03-20 00:00:00'),
  10: Timestamp('2018-06-09 00:00:00'),
  11: Timestamp('2017-01-15 00:00:00'),
  12: Timestamp('2018-05-07 00:00:00'),
  13: Timestamp('2018-01-17 00:00:00'),
  14: Timestamp('2017-07-11 00:00:00'),
  15: Timestamp('2018-12-17 00:00:00'),
  16: Timestamp('2018-12-05 00:00:00'),
  17: Timestamp('2017-05-22 00:00:00'),
  18: Timestamp('2017-08-13 00:00:00'),
  19: Timestamp('2018-05-21 00:00:00')},
 'store': {0: 81,
  1: 128,
  2: 81,
  3: 128,
  4: 25,
  5: 128,
  6: 11,
  7: 124,
  8: 43,
  9: 25,
  10: 25,
  11: 124,
  12: 124,
  13: 128,
  14: 81,
  15: 11,
  16: 124,
  17: 11,
  18: 167,
  19: 128},
 'stock': {0: 1,
  1: 236,
  2: 3,
  3: 9,
  4: 36,
  5: 78,
  6: 146,
  7: 20,
  8: 12,
  9: 12,
  10: 15,
  11: 25,
  12: 10,
  13: 7,
  14: 0,
  15: 230,
  16: 80,
  17: 6,
  18: 110,
  19: 8},
 'sells': {0: 1.0,
  1: 17.0,
  2: 1.0,
  3: 2.0,
  4: 1.0,
  5: 2.0,
  6: 7.0,
  7: 1.0,
  8: 1.0,
  9: 1.0,
  10: 2.0,
  11: 1.0,
  12: 1.0,
  13: 1.0,
  14: 1.0,
  15: 1.0,
  16: 1.0,
  17: 3.0,
  18: 2.0,
  19: 1.0}}
)

and this function that I want to use in a groupby - apply:

import numpy as np

def compute_indicator(df):
  return (
    df.copy()
    .assign(
      indicator=lambda x: x['a'] < np.percentile(x['b'], 80)
    )
    .astype(int)
    .fillna(1)
  )

Where df is meant to be a pandas DataFrame. If I do a group-by apply using pandas, the code executes as expected:

import pandas as pd
# This runs
a = pd.DataFrame.from_dict(x.to_dict()).groupby('store').apply(compute_indicator)

but when trying to run the same on koalas it gives me the following error: ValueError: cannot insert store, already exists

x.groupby('store').apply(compute_indicator)
# ValueError: cannot insert store, already exists

I cannot use the typing annotation in compute_indicator because some columns are not fixed (they travel around with the dataframe, meant to be used by another transformations).

What should I do to run the code in koalas?


Solution

  • As for Koalas 0.29.0, when koalas.DataFrame.groupby(keys).apply(f) runs for the first time over an untyped func f, it has to infer the schema, and to do this runs pandas.DataFrame.head(n).groupby(keys).apply(f). The probem is that pandas apply receives as argument the dataframe with the groupby keys as index and as columns (see this issue).

    The result of pandas.DataFrame.head(h).groupby(keys).apply(f) is then converted to a koalas.DataFrame, so if f doesn't drop the keys columns this conversion raises an exception because of duplicated column names (see issue)