Assign multi-index variable values based on the number of elements in a dataframe that match a selection criteria

I have a large csv dataset the looks like the following:

id,x,y,z
34295,695.117,74.0177,70.6486
20915,800.784,98.5225,19.3014
30369,870.428,98.742,23.9953
48151,547.681,53.055,174.176
34026,1231.02,73.7678,203.404
34797,782.725,73.9831,218.592
15598,983.502,82.9373,314.081
34076,614.738,86.3301,171.316
20328,889.016,98.9201,13.3068
...

If I consider each of these lines an element, I would like to have a data structure where I can easily divide space into x,y,z ranges (3-d blocks of space) and determine how many elements are within a given block.

For instance if I divided into cubes of 100 x 100 x 100:

counts[900][100][100] = 3

because id's 20915, 30369, and 20328 from the excerpt of the csv above are all within the range x = 800-900, y = 0-100, and z = 0-100.

The brute force way to create something like this is to create a multi-level dictionary as follows:

import numpy
import pandas

df = pandas.read_csv("test.csv")

xs = numpy.linspace(0, 1300, 14, endpoint=True)
ys = numpy.linspace(0, 1000, 11, endpoint=True)
zs = numpy.linspace(0, 1000, 11, endpoint=True)

c = {}
for x_index, x in enumerate(xs[:-1]):
    c[xs[x_index + 1]] = {}
    for y_index, y in enumerate(ys[:-1]):
        c[xs[x_index + 1]][ys[y_index + 1]] = {}
        for z_index, z in enumerate(zs[:-1]):
            c[xs[x_index + 1]][ys[y_index + 1]][zs[z_index + 1]] = df[(df["x"] > xs[x_index]) & (df["x"] <= xs[x_index + 1]) & (df["y"] > ys[y_index]) & (df["y"] <= ys[y_index + 1]) & (df["z"] > zs[z_index]) & (df["z"] <= zs[z_index + 1])]["id"].count()
            if (c[xs[x_index + 1]][ys[y_index + 1]][zs[z_index + 1]] > 0):
                print("c[" + str(xs[x_index + 1]) + "][" + str(ys[y_index + 1]) + "][" + str(zs[z_index + 1]) + "] = " + str(c[xs[x_index + 1]][ys[y_index + 1]][zs[z_index + 1]]))

This gives the expected output of:

c[600.0][100.0][200.0] = 1
c[700.0][100.0][100.0] = 1
c[700.0][100.0][200.0] = 1
c[800.0][100.0][300.0] = 1
c[900.0][100.0][100.0] = 3
c[1000.0][100.0][400.0] = 1
c[1300.0][100.0][300.0] = 1

But since the actual production CSV file is very large, it is quite slow. Any suggestions for how to make it fast and a little less clunky?

Solution

You could cut and value_counts:

tmp = df[['x', 'y', 'z']]
bins = np.arange(0, np.ceil(np.max(tmp)/100)*100, 100)

tmp.apply(lambda s: pd.cut(s, bins, labels=bins[1:])).value_counts().to_dict()

Output:

{(900.0, 100.0, 100.0): 3,
 (600.0, 100.0, 200.0): 1,
 (700.0, 100.0, 100.0): 1,
 (700.0, 100.0, 200.0): 1,
 (800.0, 100.0, 300.0): 1,
 (1000.0, 100.0, 400.0): 1}

Or round up to the nearest 100 before value_counts:

(np.ceil(df[['x', 'y', 'z']].div(100))
   .mul(100).astype(int)
   .value_counts(sort=False)
   .to_dict()
)

Output:

{(600, 100, 200): 1,
 (700, 100, 100): 1,
 (700, 100, 200): 1,
 (800, 100, 300): 1,
 (900, 100, 100): 3,
 (1000, 100, 400): 1,
 (1300, 100, 300): 1}