Search code examples
pythonpandasapache-sparkpysparkspark-koalas

Adding a new column to an existing Koalas Dataframe results in NaN's


I am trying to add a new column to my existing Koalas dataframe. But the values turn into NaN's as soon as the new column is added. I am not sure what's going on here, could anyone give me some pointers?

Here's the code:

import databricks.koalas as ks

kdf = ks.DataFrame(
    {'a': [1, 2, 3, 4, 5, 6],
     'b': [100, 200, 300, 400, 500, 600],
     'c': ["one", "two", "three", "four", "five", "six"]},
    index=[10, 20, 30, 40, 50, 60])

ks.set_option('compute.ops_on_diff_frames', True)
ks_series = ks.Series((np.arange(len(kdf.to_numpy()))))
kdf["values"] = ks_series

ks.reset_option('compute.ops_on_diff_frames')

Solution

  • You need to match the index when adding a new column:

    import databricks.koalas as ks
    import numpy as np
    
    kdf = ks.DataFrame(
        {'a': [1, 2, 3, 4, 5, 6],
         'b': [100, 200, 300, 400, 500, 600],
         'c': ["one", "two", "three", "four", "five", "six"]},
        index=[10, 20, 30, 40, 50, 60])
    
    ks.set_option('compute.ops_on_diff_frames', True)
    ks_series = ks.Series(np.arange(len(kdf.to_numpy())), index=kdf.index.tolist())
    kdf["values"] = ks_series
    
    kdf
        a    b      c  values
    10  1  100    one       0
    20  2  200    two       1
    30  3  300  three       2
    40  4  400   four       3
    50  5  500   five       4
    60  6  600    six       5