Search code examples
pysparkspark-koalas

min() function doesn't work on koalas.DataFrame columns of date types


I created the following dataframe:

import pandas as pd
import databricks.koalas as ks
df = ks.DataFrame(
    {'Date1': pd.date_range('20211101', '20211110', freq='1D'), 
     'Date2': pd.date_range('20201101', '20201110', freq='1D')})
df

Out[0]:

Date1 Date2
0 2021-11-01 2020-11-01
1 2021-11-02 2020-11-02
2 2021-11-03 2020-11-03
3 2021-11-04 2020-11-04
4 2021-11-05 2020-11-05
5 2021-11-06 2020-11-06
6 2021-11-07 2020-11-07
7 2021-11-08 2020-11-08
8 2021-11-09 2020-11-09
9 2021-11-10 2020-11-10

When trying to get the minimum of Date1 I get the correct result:

df.Date1.min()

Out[1]:

Timestamp('2021-11-01 00:00:00')

Also, when trying to get the minimum values of each row the correct result is returned:

df.min(axis=1)

Out[2]:

0   2020-11-01
1   2020-11-02
2   2020-11-03
3   2020-11-04
4   2020-11-05
5   2020-11-06
6   2020-11-07
7   2020-11-08
8   2020-11-09
9   2020-11-10
dtype: datetime64[ns]

However, using the same functions on columns fails:

df.min(axis=0)

Out[3]:

Series([], dtype: float64)

Does anyone know why this is and if there's an elegant way around it?


Solution

  • This was indeed a bug in the code, but since then Koalas was merged with pyspark and the pandas on spark API was born. More information here.

    Using spark 3.2.0 and above, one needs to replace

    import databricks.koalas as ks
    

    With

    import pyspark.pandas as ps
    

    and replace ks.DataFrame with ps.DataFrame. This completely eliminates the issue.