Search code examples
pythonpandasdataframedatabricksspark-koalas

PandasNotImplementedError: The method `pd.Series.__iter__()` is not implemented. If you want to collect your data as an NumPy array


I try to create a new column in Koalas dataframe df. The dataframe has 2 columns: col1 and col2. I need to create a new column newcol as a median of col1 and col2 values.

import numpy as np
import databricks.koalas as ks

# df is Koalas dataframe
df = df.assign(newcol=lambda x: np.median(x.col1, x.col2).astype(float))

But I get the following error:

PandasNotImplementedError: The method pd.Series.__iter__() is not implemented. If you want to collect your data as an NumPy array, use 'to_numpy()' instead.

Also I tried:

df.newcol = df.apply(lambda x: np.median(x.col1, x.col2), axis=1)

But it didn't work.


Solution

  • I had the same problem. One caveat, I'm using pyspark.pandas instead of koalas, but my understanding is that pyspark.pandas came from koalas, so my solution might still help. I tried to test it with koalas but was unable to run a cluster with a reasonable version.

    import pyspark.pandas as ps
    
    data = {"col_1": [1,2,3], "col_2": [4,5,6]}
    df = ps.DataFrame(data)
    
    median_series = df[["col_1","col_2"]].apply(lambda x: x.median(), axis=1)
    median_series.name = "median"
    
    df = ps.merge(df, median_series, left_index=True, right_index=True, how='left')
    

    On apply, the lambda parameter x is a pandas.Series of each row, so I used its median method. Annoyingly, I couldn't get any assigning to work, the only way I found was to make this ugly merge. Oh, and used left to have the peace of mind that df would keep the same number of rows, but inner could be fine depending on context