Search code examples
pythonpandasdataframedaskdask-dataframe

Replacing existing column in dask map_partitions gives SettingWithCopyWarning


I'm replacing column id2 in a dask dataframe using map_partitions. The result is that the values are replaced but with a pandas warning.

What is this warning and how to apply the .loc suggestion in the example below?

pdf = pd.DataFrame({
    'dummy2': [10, 10, 10, 20, 20, 15, 10, 30, 20, 26],
    'id2': [1, 1, 1, 2, 2, 1, 1, 1, 2, 2],
    'balance2': [150, 140, 130, 280, 260, 150, 140, 130, 280, 260]
})

ddf = dd.from_pandas(pdf, npartitions=3) 

def func2(df):
    df['id2'] = df['balance2'] + 1
    return df

ddf = ddf.map_partitions(func2)

ddf.compute()

C:\Users\xxxxxx\AppData\Local\Temp\ipykernel_30076\248155462.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df['id2'] = df['balance2'] + 1


Solution

  • A quick fix is to add copy of the dataframe:

    def func2(df):
        df = df.copy() # will make a copy of the dataframe
        df['id2'] = df['balance2'] + 1
        return df
    

    However, as I understand, copying of the dataframe is not required as the delayed nature of the dask dataframe means that the changes are not propagated back to the dask dataframe partitions.

    Update: there is a relevant question which explains the reason for .copy in pandas. In the snippet below applying the function will modify the original pandas dataframe, which might be undesirable:

    from pandas import DataFrame
    
    def addcol(df):
        df['a'] = 1
        return df
    
    df = DataFrame()
    
    df1 = addcol(df)
    # without .copy, df is also modified, which might be undesirable
    

    In the context of dask this warning is just that, a warning, so .copy is not needed.

    from dask.dataframe import from_pandas
    ddf = from_pandas(df, npartitions=1)
    ddf1 = ddf.map_partitions(addcol)
    # will show warning, but original ddf is not modified