Search code examples
pythonpandasdataframedaskdask-dataframe

Merging on columns with dask


I have a simple script currently written with pandas that I want to convert to dask dataframes.
In this script, I am executing a merge on two dataframes on user-specified columns and I am trying to convert it into dask.

def merge_dfs(df1, df2, columns):
    merged = pd.merge(df1, df2, on=columns, how='inner')
...

How can I change this line to match to dask dataframes?


Solution

  • The dask merge follows pandas syntax, so it's just substituting call to pandas with a call to dask.dataframe:

    import dask.dataframe as dd
    
    def merge_dfs(df1, df2, columns):
        merged = dd.merge(df1, df2, on=columns, how='inner')
    # ...
    

    The resulting dataframe, merged, will be a dask.dataframe and hence may need computation downstream. This will be done automatically if you are persisting the data to a file, e.g. with .to_csv or with .to_parquet.

    If you will need the dataframe for some computation and if the data fits into memory, then calling .compute will create a pandas dataframe:

    pandas_df = merged.compute()