I am constructing a Dask DataFrame
from a numpy
array and after this I would like to add a column from a pandas
Series
.
Unfortunately the resulting dataframe contains NaN
values, and I am not able to understand where the error lies.
from dask.dataframe.core import DataFrame as DaskDataFrame
import dask.dataframe as dd
import pandas as pd
import numpy as np
xy = np.random.rand(int(3e6), 2)
c = pd.Series(np.random.choice(['a', 'b', 'c'], int(3e6)), dtype='category')
# alternative 1 -> # lot of values of x, y are NaN
table: DaskDataFrame = dd.from_array(xy, columns=['x', 'y'])
table['c'] = dd.from_pandas(c, npartitions=1)
print(table.compute())
# alternative 2 -> # lot of values of c are NaN
table: DaskDataFrame = dd.from_array(xy, columns=['x', 'y'])
table['c'] = dd.from_pandas(c, npartitions=table.npartitions)
print(table.compute())
Any help is appreciated.
It all comes from a mismatch between the number of elements in c and xy when you do the partitioning. You can try using dd.from_pandas instead of dd.from_array to create the DaskDataFrame. :
import numpy as np
import pandas as pd
import dask.dataframe as dd
n = int(3e6)
xy = np.random.rand(n, 2)
c = pd.Series(np.random.choice(['a', 'b', 'c'], n), dtype='category')
table = dd.from_pandas(pd.DataFrame(xy, columns=['x', 'y']), npartitions=table.npartitions)
table['c'] = dd.from_pandas(c, npartitions=table.npartitions)
print(table.compute())
which returns:
x y c
0 0.488121 0.568258 b
1 0.090625 0.459087 b
2 0.563856 0.193026 a
3 0.333338 0.220935 c
4 0.769926 0.195786 a
... ... ... ..
2999995 0.241800 0.114924 b
2999996 0.462755 0.567131 c
2999997 0.473718 0.481577 b
2999998 0.424875 0.937403 c
2999999 0.189081 0.793600 c