When plotting scatter plots in matplotlib and saving to a vector format, in this case PDF, the generated file size is scaling with the number of points.
Since I have lots of points with large amount of overlapping points, I set alpha=.2
to see how densely distributed the points are. In central regions, this results in the displayed color equalling the appearance of alpha=1
.
Is there any way to "crop" these regions (f.i. by combining overlapping points within a specified distance) when saving the figure to a vectorized file, so some kind of area is saved instead of saving each single point?
What I forgot to mention: Since I need to plot the correlations of multiple variables, I need a (n x n) scatter plot matrix where n
is the number of variables. This impedes the use of hexbin
or other methods, since I'd have to create a full grid of plots by myself.
For example as in:
fig_sc = plt.figure(figsize=(5, 5))
ax_sc = fig_sc.gca()
ax_sc.scatter(
np.random.normal(size=100000),
np.random.normal(size=100000),
s=10, marker='o', facecolors='none', edgecolors='black', alpha=.3)
fig_sc.savefig('test.pdf', format='pdf')
This results in a file size of approximately 1.5MB, since each point is saved. Can I somehow "reduce" this image by combining overlapping points?
I tried several options such as setting dpi=300
and transparence=False
, but since PDF stores the figure as a vectorized image, this naturally didn't change anything.
Things that might work, but have drawbacks:
Any ideas?
Thanks in advance!
Maybe you want to change your approach and use something different from a scatter plot, leaving to Numpy and Matplotlib the task of lowsampling your data set — in other words, use Numpy's histogram2d
and Matplotlib's imshow
x, y = [p.random.normal(size=100000) for _ in (4, 34)]
h, xedge, yedge = np.histogram2d(x, y, bins=25)
cmap = plt.get_cmap('Greys')
plt.imshow(h, interpolation='lanczos', origin='low', cmap=cmap,
extent=[xedge[0], xedge[-1], yedge[0], yedge[-1]])
plt.savefig('Figure1.pdf') # → 30384 bytes
Grid arrangement (this time using hexbin
)
np.random.seed(20190308)
fig, axes = plt.subplots(3, 2, figsize=(4,6),
subplot_kw={'xticks': [], 'yticks': []})
fig.subplots_adjust(hspace=0.05, wspace=0.05)
for ax in axes.flat:
ax.hexbin(*(np.random.normal(size=10000) for _ in ('x', 'y')), cmap=cmap)