Search code examples
pythonpandasmatplotlibdaskdask-dataframe

Using Matplotlib with Dask


Let's say we have pandas dataframe pd and a dask dataframe dd. When I want to plot pandas one with matplotlib I can easily do it:

fig, ax = plt.subplots()
ax.bar(pd["series1"], pd["series2"])
fig.savefig(path)

However, when I am trying to do the same with dask dataframe I am getting Type Errors such as:

TypeError: Cannot interpret 'string[python]' as a data type

string[python] is just an example, whatever is your dd["series1"] datatype will be inputed here.

So my question is: What is the proper way to use matplotlib with dask, and is this even a good idea to combine the two libraries?


Solution

  • SultanOrazbayev's is still spot on, here is an answer elaborating on the datashader option (which hvplot call under the hood).

    Don't use Matplotlib, use hvPlot!

    If you wish to plot the data while it's still large, I recommend using hvPlot, as it can natively handle dask dataframes. It also automatically provides interactivity.

    Example

    import numpy as np
    import dask
    import hvplot.dask
    
    # Create Dask DataFrame with normally distributed data
    df = dask.datasets.timeseries()
    df['x'] = df['x'].map_partitions(lambda x: np.random.randn(len(x)))
    df['y'] = df['y'].map_partitions(lambda x: np.random.randn(len(x)))
    
    # Plot
    df.hvplot.scatter(x='x', y='y', rasterize=True)