Search code examples
pythonnumpyhistogramdata-processingbinning

How to determine a grid that bins the given data into a smooth histogram?


Description

I have some simulation data that should be binned into a histogram (binning with sum).

The original data is shown in the figure below. The blue dots are the data to be binned, the red lines are the logscale grids that define the bin boundaries. correct grid

When correct number of bins (50) is used, the result binned histogram is a smooth curve. correct histogram

However, if I used an incorrect grid size, for example, 58, the result becomes oscillatory incorrect histogram

To understand the reason, look at the incorrect grid below: seems that the grid may split some periodic data in a wrong period, causing the swinging assignment of the data points and the oscillatory data. incorrect grid

Question

Currently, I find the optimal grid by trial and error. I wonder whether there is a easy way to find the grid that will bin the data into a smooth curve (assume always exists) ?

Sample data

The sample data N is uploaded in this gist. The first column is Size, and the second column is Count.

I have created a sample colab notebook to reproduce the plot.

Thanks!


Solution

  • This is still 'trial and error', but at least programmatically. I assume we only want to find out the number of bins.

    Let's check up to a certain amount of bins for the best one. We'll define the best one as the one that minimizes the mean of the absolute differences of the logs of your 'counts', putting a huge penalty on a positive difference (representing a jump in the graph).

    def judge_grid(N, grid, pos_penalty=1e5):
        stat, bin_edges, _ = binned_statistic(N[:, 0],  N[:, 1], statistic="sum", bins=grid)
        logcounts = np.log(stat) - np.log(bin_edges[1:] - bin_edges[:-1])
        d = np.diff(logcounts)
        # Huge penalty for positive difference.
        ad = np.where(d > 0, d * pos_penalty, -d)
        return np.mean(ad)
    
    lo = np.log10(1e-5)
    hi = np.log10(1.0)
    min_bins = 10
    max_bins = 80
    best_num_bins = min(range(min_bins, 1+max_bins),
                        key=lambda b: judge_grid(N, np.logspace(lo, hi, b)))
    print(best_num_bins) 
    

    For your example this returns 50.