Search code examples
pythonmatplotlibscatter-plotvector-graphics

Reduce size on disk of vectorized scatter plot with many overlapping points and alpha


When plotting scatter plots in matplotlib and saving to a vector format, in this case PDF, the generated file size is scaling with the number of points.

Since I have lots of points with large amount of overlapping points, I set alpha=.2 to see how densely distributed the points are. In central regions, this results in the displayed color equalling the appearance of alpha=1.

Is there any way to "crop" these regions (f.i. by combining overlapping points within a specified distance) when saving the figure to a vectorized file, so some kind of area is saved instead of saving each single point?

What I forgot to mention: Since I need to plot the correlations of multiple variables, I need a (n x n) scatter plot matrix where n is the number of variables. This impedes the use of hexbin or other methods, since I'd have to create a full grid of plots by myself.

For example as in:

fig_sc = plt.figure(figsize=(5, 5))
ax_sc = fig_sc.gca()
ax_sc.scatter(
    np.random.normal(size=100000), 
    np.random.normal(size=100000), 
    s=10, marker='o', facecolors='none', edgecolors='black', alpha=.3)
fig_sc.savefig('test.pdf', format='pdf')

This results in a file size of approximately 1.5MB, since each point is saved. Can I somehow "reduce" this image by combining overlapping points?

I tried several options such as setting dpi=300 and transparence=False, but since PDF stores the figure as a vectorized image, this naturally didn't change anything.

Things that might work, but have drawbacks:

  • hexbin plots: Works for a single scatter plot if the resolution and cmap is adjusted correctly, but I want to plot a scatter-matrix with (n x n) scatter plots. There is afaik no hexbin-matrix plot.
  • saving to a rasterized format: The plots are for a journal which requests vectorized plots whereever possible. Thus I'd like to avoid storing the image as a rasterized image.
  • randomly extracting parts of the data: might work, but will alter the appearance of the plots.

Any ideas?
Thanks in advance!


Solution

  • Maybe you want to change your approach and use something different from a scatter plot, leaving to Numpy and Matplotlib the task of lowsampling your data set — in other words, use Numpy's histogram2d and Matplotlib's imshow

    x, y = [p.random.normal(size=100000) for _ in (4, 34)]
    h, xedge, yedge = np.histogram2d(x, y, bins=25)
    cmap = plt.get_cmap('Greys')
    plt.imshow(h, interpolation='lanczos', origin='low', cmap=cmap,
                extent=[xedge[0], xedge[-1], yedge[0], yedge[-1]])
    

    enter image description here

    plt.savefig('Figure1.pdf') # → 30384 bytes
    

    Grid arrangement (this time using hexbin)

    np.random.seed(20190308)                                                         
    fig, axes = plt.subplots(3, 2, figsize=(4,6), 
                             subplot_kw={'xticks': [], 'yticks': []}) 
    fig.subplots_adjust(hspace=0.05, wspace=0.05)                                    
    
    for ax in axes.flat: 
        ax.hexbin(*(np.random.normal(size=10000) for _ in ('x', 'y')), cmap=cmap) 
    

    enter image description here