Search code examples
pythonpandasdask-dataframe

Cannot add a column (pandas `Series`) to a Dask `DataFrame` without introducing `NaN`


I am constructing a Dask DataFrame from a numpy array and after this I would like to add a column from a pandas Series.

Unfortunately the resulting dataframe contains NaN values, and I am not able to understand where the error lies.

from dask.dataframe.core import DataFrame as DaskDataFrame
import dask.dataframe as dd
import pandas as pd
import numpy as np

xy = np.random.rand(int(3e6), 2)
c = pd.Series(np.random.choice(['a', 'b', 'c'], int(3e6)), dtype='category')

# alternative 1 ->  # lot of values of x, y are NaN
table: DaskDataFrame = dd.from_array(xy, columns=['x', 'y'])
table['c'] = dd.from_pandas(c, npartitions=1)
print(table.compute())

# alternative 2 ->  # lot of values of c are NaN
table: DaskDataFrame = dd.from_array(xy, columns=['x', 'y'])
table['c'] = dd.from_pandas(c, npartitions=table.npartitions)
print(table.compute())

Any help is appreciated.


Solution

  • It all comes from a mismatch between the number of elements in c and xy when you do the partitioning. You can try using dd.from_pandas instead of dd.from_array to create the DaskDataFrame. :

    import numpy as np
    import pandas as pd
    import dask.dataframe as dd
    
    n = int(3e6)
    xy = np.random.rand(n, 2)
    c = pd.Series(np.random.choice(['a', 'b', 'c'], n), dtype='category')
    
    table = dd.from_pandas(pd.DataFrame(xy, columns=['x', 'y']), npartitions=table.npartitions)
    table['c'] = dd.from_pandas(c, npartitions=table.npartitions)
    print(table.compute())
    

    which returns:

                    x         y  c
    0        0.488121  0.568258  b
    1        0.090625  0.459087  b
    2        0.563856  0.193026  a
    3        0.333338  0.220935  c
    4        0.769926  0.195786  a
    ...           ...       ... ..
    2999995  0.241800  0.114924  b
    2999996  0.462755  0.567131  c
    2999997  0.473718  0.481577  b
    2999998  0.424875  0.937403  c
    2999999  0.189081  0.793600  c