Search code examples
pythonpandasplotdata-visualizationdata-virtualization

Data visulisation using ridge and scatter plot


Background: I am working on python, I have a lot of data points (in .CSV form) so far what the code I have

  1. Reads the csv and the "result" column
  2. if the value in the "result" column is positive, the code plots the A B C D E F G parameters corresponding to it in such a way that the y-axis is the value of the parameters and x-axis is the name of the parameter.
  3. If the number of such "result" are more than 10 It plots the first 10 A B C D E F G parameters corresponding to the results.

An example of the type of dataset is below. (Mine contains around 12000 rows)

The Dataset


  A     B       C     D       E     F    G    result
1.00   0.85  -0.999  0.27   0.98  0.39  0.80  -0.86
0.89   0.4   -0.6    0.47   0.28  0.29  0.26   0.65
0.65  -1.00   0.26   0.67  -0.88  0.29  0.10   0.50
0.98  -0.98   0.76   0.37   0.68  0.59  0.90      0
   0   0.5    0.56   0.27   0.38  0.79  0.48  -0.65 

The code :

df = pd.read_csv("result.csv")
df.loc[df.result>0, df.columns[:-1]].T.plot(ls='', marker='o')

Issue : Sometimes if the value is the same the dot mark is at the same place thus it's hard to see the frequency distribution(such as in Column B and C below though they look similar one value has more points.

What I want to do is to plot something like a ridge plot on the current graph (as I drew below )so that the frequency distribution can be seen. I am a novice in this type of data visualization. Kindly guide me on how it could be done

enter image description here


Solution

  • The density plot type already does pretty much what you want, we just need to superpose it to your data:

    >>> data_to_plot = df.loc[df.result>0, df.columns[:-1]]
    >>> data_to_plot.plot(kind='density')
    

    basic KDE plot of data

    This is trivial if you want horizontal subplots, you can simply use the subplots=True on either plot (and then zip the returned axes with columns to superpose the other plot):

    >>> axes = data_to_plot.plot(kind='density', subplots=True, legend=False)
    >>> for ax, (colname, series) in zip(axes, data_to_plot.iteritems()):
    ...     ax.plot(series.values, np.zeros_like(series), ls='', marker='o')
    ...     ax.set_ylabel(colname)
    

    enter image description here

    However if you want them vertically it’s likely we’ll have to compute the Gaussian densities ourselves. Pandas documentation points to scipy.stats.gaussian_kde. For this we’ll need to know at which points to interpolate the kernel. On your example it looks like [-1..1] is a good interval but of course you can take it from data min/max.

    >>> from scipy.stats import gaussian_kde
    >>> y = np.arange(-1, 1.01, .01)
    >>> ridges = data_to_plot.apply(lambda s: gaussian_kde(s)(y))
    >>> ridges
                A         B         C             D         E             F         G
    0    0.001119  0.271510  0.270048  2.029737e-24  0.163222  2.352981e-15  0.000018
    1    0.001247  0.272310  0.272122  4.796826e-24  0.164507  3.959987e-15  0.000021
    2    0.001389  0.273071  0.274155  1.125941e-23  0.165765  6.637610e-15  0.000025
    3    0.001545  0.273794  0.276145  2.624972e-23  0.166995  1.108083e-14  0.000030
    4    0.001717  0.274479  0.278093  6.078288e-23  0.168200  1.842365e-14  0.000036
    ..        ...       ...       ...           ...       ...           ...       ...
    196  0.939109  0.307535  0.314227  3.791151e-02  0.436305  3.153771e-01  0.630121
    197  0.932996  0.304793  0.310216  3.100156e-02  0.431472  2.913782e-01  0.615406
    198  0.926089  0.302012  0.306172  2.518140e-02  0.426576  2.682819e-01  0.600298
    199  0.918401  0.299193  0.302097  2.031681e-02  0.421619  2.461581e-01  0.584834
    200  0.909948  0.296337  0.297994  1.628194e-02  0.416607  2.250649e-01  0.569049
    
    [201 rows x 7 columns]
    

    Then simply ploy with zip, as before. There might be some adjustment needed, but this is how it looks like with your sample data. Note the scaling of ridges so they are all on the same scale and fit inside a 0.5-wide space on the plot.

    >>> ax = data_to_plot.T.plot(ls='', marker='o')
    >>> for n, (colname, ridge) in enumerate(ridges.iteritems()):
    ...     ax.plot(ridge / (-2 * ridges.max().max()) + n, y, color='black')
    

    vertical ridges with points