I have a simple script currently written with pandas that I want to convert to dask dataframes.
In this script, I am executing a merge on two dataframes on user-specified columns and I am trying to convert it into dask.
def merge_dfs(df1, df2, columns):
merged = pd.merge(df1, df2, on=columns, how='inner')
...
How can I change this line to match to dask dataframes?
The dask
merge follows pandas
syntax, so it's just substituting call to pandas
with a call to dask.dataframe
:
import dask.dataframe as dd
def merge_dfs(df1, df2, columns):
merged = dd.merge(df1, df2, on=columns, how='inner')
# ...
The resulting dataframe, merged
, will be a dask.dataframe
and hence may need computation downstream. This will be done automatically if you are persisting the data to a file, e.g. with .to_csv
or with .to_parquet
.
If you will need the dataframe for some computation and if the data fits into memory, then calling .compute
will create a pandas
dataframe:
pandas_df = merged.compute()