python algorithm numpy stream numerical-methods

Divide a stream into bins with equal counts

Ideally, I want the following without reading the data from hard disk for too many times. The data is big and memory can't hold all of the data at the same time.

Input is a stream x[t] from hard disk. The stream of numbers contains N elements.
It is possible to have a histogram of x with m bins.
The n bins are defined by the bin edges e₀ < e₁, ..., < e_m. For example, if e_i =< x[0] < e_i+1, then x[0] belongs to the i^th bin.
Find the bin edges that makes the bin holds an nearly equal number of elements from the stream. The number of elements in each bin should be ideally within some threshold percent from N/m. This is because if we evenly distribute N elements among m bins, each bins should hold about N/m elements.

Current solution:

import numpy as np


def test_data(size):
    x = np.random.normal(0, 0.5, size // 2)
    x = np.hstack([x, np.random.normal(4, 1, size // 2)])
    return x


def bin_edge_as_index(n_bin, fine_hist, fine_n_bin, data_size):
    cum_sum = np.cumsum(fine_hist)
    bin_id = np.empty((n_bin + 1), dtype=int)

    count_per_bin = data_size * 1.0 / n_bin

    for i in range(1, n_bin):
        bin_id[i] = np.argmax(cum_sum > count_per_bin * i)

    bin_id[0] = 0
    bin_id[n_bin] = fine_n_bin
    return bin_id


def get_bin_count(bin_edge, data):
    n_bin = bin_edge.shape[0] - 1
    result = np.zeros((n_bin), dtype=int)
    for i in range(n_bin):
        cmp0 = (bin_edge[i] <= data)
        cmp1 = (data < bin_edge[i + 1])
        result[i] = np.sum(cmp0 & cmp1)
    return result


# Test Setting
test_size = 10000
n_bin = 6
fine_n_bin = 2000  # use a big number and hope it works

# Test Data
x = test_data(test_size)

# Fine Histogram
fine_hist, fine_bin_edge = np.histogram(x, fine_n_bin)

# Index of the bins of the fine histogram that contains
# the required bin edges (e_1, e_2, ... e_n)
bin_id = bin_edge_as_index(
    n_bin, fine_hist, fine_n_bin, test_size)

# Find the bin edges
bin_edge = fine_bin_edge[bin_id]
print("bin_edges:")
print(bin_edge)

# Check
bin_count = get_bin_count(bin_edge, x)
print("bin_counts:")
print(bin_count)
print("ideal count per bin:")
print(test_size * 1.0 / n_bin)

Output of program:

bin_edges:
[-1.86507282 -0.22751473  0.2085489   1.30798591  3.57180559  4.40218207
  7.41287669]
bin_counts:
[1656 1675 1668 1663 1660 1677]
ideal count per bin:
1666.6666666666667

Problem:

I can't specify a threshold s, and expect the bin counts are at most s% different from the ideal counts per bin.

Solution

Assuming that the distribution is not outrageously skewed (like 10000 values between 1.0000001 and 1.0000002 and 10000 others between 9.0000001 and 9.0000002), you can proceed as below.

Compute a histogram with a sufficient resolution, say K bins, which covers the whole range (hopefully known beforehand). This will take a single pass over the data.

Then compute the cumulated histogram and as you go, identify the m+1 quantile edges (where the cumulated counts cross multiples of N/m).

The accuracy that you will get is dictated by the maximum number of elements in a bin of the original histogram.

For N elements, using an histogram of K bins and assuming some "nonuniformity factor" (equal to a few units for reasonable distributions), the maximum error will be f.N/K.

You can improve accuracy if you like by considering m+1 auxiliary histograms which only accumulate the values that fall in the quantile bins of the global histogram. Then you can refine the quantiles to the resolution of these auxiliary histograms.

This will cost you an extra pass, but the error will be reduced to f.N/(K.K'), using K then m.K' histogram space only, instead of K.K'.