Background: I am working on python, I have a lot of data points (in .CSV form) so far what the code I have
An example of the type of dataset is below. (Mine contains around 12000 rows)
The Dataset
A B C D E F G result
1.00 0.85 -0.999 0.27 0.98 0.39 0.80 -0.86
0.89 0.4 -0.6 0.47 0.28 0.29 0.26 0.65
0.65 -1.00 0.26 0.67 -0.88 0.29 0.10 0.50
0.98 -0.98 0.76 0.37 0.68 0.59 0.90 0
0 0.5 0.56 0.27 0.38 0.79 0.48 -0.65
The code :
df = pd.read_csv("result.csv")
df.loc[df.result>0, df.columns[:-1]].T.plot(ls='', marker='o')
Issue : Sometimes if the value is the same the dot mark is at the same place thus it's hard to see the frequency distribution(such as in Column B and C below though they look similar one value has more points.
What I want to do is to plot something like a ridge plot on the current graph (as I drew below )so that the frequency distribution can be seen. I am a novice in this type of data visualization. Kindly guide me on how it could be done
The density
plot type already does pretty much what you want, we just need to superpose it to your data:
>>> data_to_plot = df.loc[df.result>0, df.columns[:-1]]
>>> data_to_plot.plot(kind='density')
This is trivial if you want horizontal subplots, you can simply use the subplots=True
on either plot (and then zip the returned axes with columns to superpose the other plot):
>>> axes = data_to_plot.plot(kind='density', subplots=True, legend=False)
>>> for ax, (colname, series) in zip(axes, data_to_plot.iteritems()):
... ax.plot(series.values, np.zeros_like(series), ls='', marker='o')
... ax.set_ylabel(colname)
However if you want them vertically it’s likely we’ll have to compute the Gaussian densities ourselves. Pandas documentation points to scipy.stats.gaussian_kde. For this we’ll need to know at which points to interpolate the kernel. On your example it looks like [-1..1] is a good interval but of course you can take it from data min/max.
>>> from scipy.stats import gaussian_kde
>>> y = np.arange(-1, 1.01, .01)
>>> ridges = data_to_plot.apply(lambda s: gaussian_kde(s)(y))
>>> ridges
A B C D E F G
0 0.001119 0.271510 0.270048 2.029737e-24 0.163222 2.352981e-15 0.000018
1 0.001247 0.272310 0.272122 4.796826e-24 0.164507 3.959987e-15 0.000021
2 0.001389 0.273071 0.274155 1.125941e-23 0.165765 6.637610e-15 0.000025
3 0.001545 0.273794 0.276145 2.624972e-23 0.166995 1.108083e-14 0.000030
4 0.001717 0.274479 0.278093 6.078288e-23 0.168200 1.842365e-14 0.000036
.. ... ... ... ... ... ... ...
196 0.939109 0.307535 0.314227 3.791151e-02 0.436305 3.153771e-01 0.630121
197 0.932996 0.304793 0.310216 3.100156e-02 0.431472 2.913782e-01 0.615406
198 0.926089 0.302012 0.306172 2.518140e-02 0.426576 2.682819e-01 0.600298
199 0.918401 0.299193 0.302097 2.031681e-02 0.421619 2.461581e-01 0.584834
200 0.909948 0.296337 0.297994 1.628194e-02 0.416607 2.250649e-01 0.569049
[201 rows x 7 columns]
Then simply ploy with zip, as before. There might be some adjustment needed, but this is how it looks like with your sample data. Note the scaling of ridges so they are all on the same scale and fit inside a 0.5-wide space on the plot.
>>> ax = data_to_plot.T.plot(ls='', marker='o')
>>> for n, (colname, ridge) in enumerate(ridges.iteritems()):
... ax.plot(ridge / (-2 * ridges.max().max()) + n, y, color='black')