I try to create a new column in Koalas dataframe df
. The dataframe has 2 columns: col1
and col2
. I need to create a new column newcol
as a median of col1
and col2
values.
import numpy as np
import databricks.koalas as ks
# df is Koalas dataframe
df = df.assign(newcol=lambda x: np.median(x.col1, x.col2).astype(float))
But I get the following error:
PandasNotImplementedError: The method
pd.Series.__iter__()
is not implemented. If you want to collect your data as an NumPy array, use 'to_numpy()' instead.
Also I tried:
df.newcol = df.apply(lambda x: np.median(x.col1, x.col2), axis=1)
But it didn't work.
I had the same problem. One caveat, I'm using pyspark.pandas instead of koalas, but my understanding is that pyspark.pandas came from koalas, so my solution might still help. I tried to test it with koalas but was unable to run a cluster with a reasonable version.
import pyspark.pandas as ps
data = {"col_1": [1,2,3], "col_2": [4,5,6]}
df = ps.DataFrame(data)
median_series = df[["col_1","col_2"]].apply(lambda x: x.median(), axis=1)
median_series.name = "median"
df = ps.merge(df, median_series, left_index=True, right_index=True, how='left')
On apply, the lambda parameter x is a pandas.Series of each row, so I used its median method. Annoyingly, I couldn't get any assigning to work, the only way I found was to make this ugly merge. Oh, and used left to have the peace of mind that df would keep the same number of rows, but inner could be fine depending on context