Search code examples
pythonmatplotlibscipyhistogrambinning

2D histogram colour by "label fraction" of data in each bin


Following on from the post found here: 2D histogram coloured by standard deviation in each bin

I would like to colour each bin in a 2D grid by the fraction of points whose label values are below a certain threshold in Python.

Note that, in this dataset, each point has a continuous label value between 0-1.

For example here is a histogram I made whereby the colour denotes the standard deviation of label values of all points in each bin:

enter image description here

The way this was done was by using

scipy.stats.binned_statistic_2d()

(see: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.binned_statistic_2d.html)

..and setting the statistic argument to 'std'

But is there a way to change this kind of plot so that the colouring is representative of the fraction of points in each bin with label value below 0.5 for example?

It could be that the only way to do this is by explicitly defining a grid of some kind and calculating the fractions but I'm not sure of the best way to do that so any help on this matter would be greatly appreciated!

Maybe using scipy.stats.binned_statistic_2d or numpy.histogram2d and being able to return the raw data values in each bin as a multi dimensional array would help in being able to quickly compute the fractions explicitly.


Solution

  • The fraction of elements in an array below a threshold can be calculated as

    fraction = lambda a, threshold: len(a[a<threshold])/len(a)
    

    Hence you can call

    scipy.stats.binned_statistic_2d(x, y, values, statistic=lambda a: fraction(a, 0.5))