How to make this calculation faster?

So I have an issue that I will explain: I have a set of images that come from an experiment, each pixel of these images has 3 values associated which depend on the position (in case you wonder this is a diffraction experiment). What I would need is the intensity of the pixel, but as a function of these three values. The way I'm currently implementing it in python is with a bunch of for loops that find the pixels that have values within a narrow range and summing them together and saving this as a point in my new "space". Here is a basic implementation with some duummy values:

import numpy as np


def MaskApply(HKLvals, h, k, ll, dh, dk, dl):
    ans = (
        np.where(HKLvals[2].flatten() < ll + dl, 1, 0)
        * np.where(HKLvals[2].flatten() > ll, 1, 0)
        * np.where(HKLvals[1].flatten() > k, 1, 0)
        * np.where(HKLvals[1].flatten() < k + dk, 1, 0)
        * np.where(HKLvals[0].flatten() < h + dh, 1, 0)
        * np.where(HKLvals[0].flatten() > h, 1, 0)
    )
    ans = np.reshape(ans, HKLvals[2].shape, order="C")

    return ans


data = np.random.rand(120, 1000, 500)  # this is a collection of 120 images each 1000x500

# This is read by a file in my actual code
HKLvals = np.ones((120, 3, 1000, 500))  # These are some values connected to the images each of the 1000x500 pixels has three values. for the 120 images
H, dh = np.linspace(0, 10, 1000, retstep=True)
K, dk = np.linspace(0, 10, 1000, retstep=True)
L, dl = np.linspace(0, 5, 100, retstep=True)
outData = np.zeros((len(H), len(K)))

for i in range(len(L)):
    l = L[i]
    for j in range(len(H)):
        h = H[j]
        for t in range(len(K)):
            k = K[t]
            maskArr = np.zeros(np.shape(data))
            pixel_number = 0
            for m in range(np.shape(data)[0]):
                mask = MaskApply(HKLvals[m], h, k, l, dh, dk, dl)
                pixel_number += np.count_nonzero(mask)
                maskArr[m] = mask
            maskArr = maskArr.astype(bool)
            outData[j, t] = np.sum(data[maskArr]) / pixel_number
    print(outData)  # of course in reality this is saved in an .h5

This is the gist of my code ATM. I'd love if anybody had any input on this and could point out some efficiency mistakes, or perhaps a different algorithm to do this that maybe I'm not thinking about.

I suppose the easiest way to improve the efficiency would be to stop using Python and convert to something more performing like C++ or Julia. I'm also looking into that, but I'm very not well versed in those so in the meantime it would be great if anybody had any input.

Thanks to all who will read.

EDIT: fixed the code which is now running properly

Solution

Given our conversation in the question comments, I propose the following solution, that addresses a few points mentioned by other people, namely:

Create the masks several times (here you create the masks just once for each image)
flatten and reshape several time (no need to do that, numpy can handle comparisons in multi-dimensional case)
mask combination using appropriate tools
result aggregation (you can just compute the average, no need to keep track of number of pixels selectede)

here is the code:

import numpy as np
from itertools import product


def process_image(image, HKL_values, H, K, L, dh, dk, dl):

    Nh, Nk, Nl = map(len, (H, K, L))

    h_mask = np.array(
        [np.logical_and(
            HKL_values[0] > h,
            HKL_values[0] < h + dh
        ) for h in H]
    )
    k_mask = np.array(
        [np.logical_and(
            HKL_values[1] > k,
            HKL_values[1] < k + dk
        ) for k in K]
    )
    l_mask = np.array(
        [np.logical_and(
            HKL_values[2] > ll,
            HKL_values[2] < ll + dl
        ) for ll in L]
    )

    intensities = np.zeros((Nh, Nk, Nl))
    for h, k, l in product(range(Nh), range(Nk), range(Nl)):
        mask = h_mask[h] & k_mask[k] & l_mask[l]
        intensities[h, k, l] = np.average(image[mask])
    return intensities

# Load data
data = ...  # load images
HKL_vals = ... # load HKL values for those images

# define HKL steps
H, dh = np.linspace(0, 10, 1000, retstep=True)
K, dk = np.linspace(0, 10, 1000, retstep=True)
L, dl = np.linspace(0, 5, 100, retstep=True)

for i in range(len(data)):
    image = data[i]
    vals = HKL_vals[i]

    result = process_image(image, vals, H, K, L, dh, dk, dl)

    # save result for that image

It uses the process image function to create the masks once, and then uses the itertools function product to create all possible combinations of H, K, and L masks (for each of them value steps), selecting those corresponding pixels from the image and averaging their intensity. This produces a 3D-array with indices along H, K, and L steps, each with the average intensity of pixels that correspond to that particular HKL configuration.

Note also that x_mask will be a collection of all masks for the dimension x, and x_mask[i] is the mask corresponding to the i-th step of dimension x

Important note before running

As numpy stores booleans in a byte (8bits), the memory consumption of pre-calculated masks is 8x heavier than what this answer considers. As pointed out by @SamMason, h_mask will occupy ~500MB! If you have RAM limitations (in the end the masks alone will demand ~1.05GB, and you still need essentially 4D images@ 1000x500), you will inevitably have to break down the solution, pre-computing the smaller masks, and then computing the other masks as needed (just the body of adapted process_image:

l_mask = np.array(
    [np.logical_and(
        HKL_values[2] > L[l],
        HKL_values[2] < L[l] + dl
    ) for ll in L]
)

intensities = np.zeros((Nh, Nk, Nl))

for h in range(Nh):
    h_mask = np.logical_and(
        HKL_values[0] > H[h],
        HKL_values[0] < H[h] + dl
    )
    for k in range(Nk):
        k_mask = np.logical_and(
            HKL_values[1] > K[k],
            HKL_values[1] < K[k] + dl
        )
        for l in range(Nl):
            mask = h_mask & k_mask & l_mask[l]
            intensities[h, k, l] = np.average(image[mask])

return intensities

Note that the precomputed masks are used in the inner-most for-loop, which is the one that runs the most times.