Search code examples
pythonpandasdataframespark-koalas

Filter index values in koalas Data frame


I am trying to recreate the below operation in kolas, In pandas this works when i try the same in koalas it throws an error.

Operation tried in Pandas:

df = pd.DataFrame({'foo':['a','b','c','d','e'], 'bar':['1', '2', '3','4','5']})
df1 = pd.DataFrame({'foo':['a','b','c'], 'bar':['1', '2', '3']})

ci = [4,32,12,1]

df[df.index.get_level_values(0).isin(ci)]

Output:

foo bar 1 b 2 4 e 5

Operation tried in Koalas:

df = ks.DataFrame({'foo':['a','b','c','d','e'], 'bar':['1', '2', '3','4','5']})
df1 = ks.DataFrame({'foo':['a','b','c'], 'bar':['1', '2', '3']})

ci = [4,32,12,1]

df[df.index.get_level_values(0).isin(ci)]

Output: PandasNotImplementedError: The method pd.Index.__iter__() is not implemented. If you want to collect your data as an NumPy array, use 'to_numpy()' instead.


Solution

  • Looks like Index.get_level_values() is using __iter__() behind the scenes, which is not supported in Koalas.

    Couple of thoughts:

    1. Why the need to use get_level_values() at all? df[df.index.isin(ci)] works just as well.

    2. The "proper" way to index with missing labels would be to use .reindex(). It would fill the rows that are missing from the new index with NaNs, which you'll have to drop:

    new_df = df.reindex(index=ci).dropna()