python pandas dataframe dask dask-dataframe

Merging on columns with dask

I have a simple script currently written with pandas that I want to convert to dask dataframes.
In this script, I am executing a merge on two dataframes on user-specified columns and I am trying to convert it into dask.

def merge_dfs(df1, df2, columns):
    merged = pd.merge(df1, df2, on=columns, how='inner')
...

How can I change this line to match to dask dataframes?

Solution

The dask merge follows pandas syntax, so it's just substituting call to pandas with a call to dask.dataframe:

import dask.dataframe as dd

def merge_dfs(df1, df2, columns):
    merged = dd.merge(df1, df2, on=columns, how='inner')
# ...

The resulting dataframe, merged, will be a dask.dataframe and hence may need computation downstream. This will be done automatically if you are persisting the data to a file, e.g. with .to_csv or with .to_parquet.

If you will need the dataframe for some computation and if the data fits into memory, then calling .compute will create a pandas dataframe:

pandas_df = merged.compute()