I created the following dataframe:
import pandas as pd
import databricks.koalas as ks
df = ks.DataFrame(
{'Date1': pd.date_range('20211101', '20211110', freq='1D'),
'Date2': pd.date_range('20201101', '20201110', freq='1D')})
df
Out[0]:
Date1 | Date2 | |
---|---|---|
0 | 2021-11-01 | 2020-11-01 |
1 | 2021-11-02 | 2020-11-02 |
2 | 2021-11-03 | 2020-11-03 |
3 | 2021-11-04 | 2020-11-04 |
4 | 2021-11-05 | 2020-11-05 |
5 | 2021-11-06 | 2020-11-06 |
6 | 2021-11-07 | 2020-11-07 |
7 | 2021-11-08 | 2020-11-08 |
8 | 2021-11-09 | 2020-11-09 |
9 | 2021-11-10 | 2020-11-10 |
When trying to get the minimum of Date1
I get the correct result:
df.Date1.min()
Out[1]:
Timestamp('2021-11-01 00:00:00')
Also, when trying to get the minimum values of each row the correct result is returned:
df.min(axis=1)
Out[2]:
0 2020-11-01
1 2020-11-02
2 2020-11-03
3 2020-11-04
4 2020-11-05
5 2020-11-06
6 2020-11-07
7 2020-11-08
8 2020-11-09
9 2020-11-10
dtype: datetime64[ns]
However, using the same functions on columns fails:
df.min(axis=0)
Out[3]:
Series([], dtype: float64)
Does anyone know why this is and if there's an elegant way around it?
This was indeed a bug in the code, but since then Koalas was merged with pyspark and the pandas on spark API was born. More information here.
Using spark 3.2.0 and above, one needs to replace
import databricks.koalas as ks
With
import pyspark.pandas as ps
and replace ks.DataFrame
with ps.DataFrame
. This completely eliminates the issue.