Demean and standardise a 4d array within a 3d mask?

import numpy as np

ts = np.random.rand(40,45,40,1000)
mask = np.random.randint(2, size=(40,45,40),dtype=bool)

#creating a masked array
ts_m = np.ma.array(ts, mask=ts*~mask[:,:,:,np.newaxis])
#demeaning
ts_md = ts_m - ts_m.mean(axis=3)[:,:,:,np.newaxis]
#standardisation
ts_mds = ts_md / ts_md.std(ddof=1,axis=3)[:,:,:,np.newaxis]

I would like to demean ts (along axis 3), and divide by its standard deviation (along axis 3), all within the mask.

Am I doing this correctly ?

Is there a faster method ?

Solution

You have a couple of options available to you.

The first is to use masked arrays as you are doing, but provide a proper mask and use the masked functions. Right now, your code is computing all the means and standard deviations, and slapping a mask on the result. To skip masked elements, use np.ma.mean and np.ma.std, and thereby avoid doing a whole lot of extra work.

As you correctly understood, the size of the mask must match that of the data. While multiplying by the data gives you the correct size, it is expensive and gives the wrong result in the general case since the mask will be zero whenever either data or mask is zero. A better approach would be to create a view of the mask repeated along the last (new) dimension. You can use np.broadcast_to if you get the trailing dimensions to match up first:

ts = np.random.rand(40, 45, 40, 1000)
mask = np.random.randint(2, size=(40, 45, 40), dtype=np.bool)

#creating a masked array
ts_m = np.ma.array(ts, mask=np.broadcast_to(mask[..., None], ts.shape)
#demeaning
ts_md = ts_m - np.ma.mean(ts_m, axis=3)[..., None]
#standardisation
ts_mds = ts_md / np.ma.std(ts_m, ddof=1,axis=3)[..., None]

The mask is read only, and because it likely has a dimension with zero stride, can sometimes do unexpected things. The broadcasted version here is roughly equivalent to

np.lib.stride_tricks.as_strided(mask, ts.shape, (*mask.strides, 0), writeable=False)

Both versions create views to the original data, so are very fast. They just allocate a new array object that points to the existing data, which is not copied. Keep in mind that np.lib.stride_tricks.as_strided is a sledgehammer that should be used with the utmost care. It will crash your interpreted any day if you let it.

Note: The mask in a masked array is interpreted as True being masked, while Boolean indexing arrays are interpreted with False masked. Depending on how it's obtained and it's meaning in your real code, you may want to invert the mask

mask=np.broadcast_to(~mask[..., None], ...)

Another option is to implement the masking yourself. There are two ways you can do that. If you do it up-front, the mask will be applied to the leading dimensions of your data:

ts = np.random.rand(40, 45, 40, 1000)
mask = np.random.randint(2, size=(40, 45, 40), dtype=np.bool)

#creating a masked array
mask = ~mask  # optional, see note above
ts_m = ts[mask]
#demeaning
ts_md = ts_m - ts_m.mean(axis=-1)
#standardisation
ts_mds = ts_md / ts_md.std(ddof=1,axis=-1)
# reshaping
result = np.empty_like(ts)  # alternatively, np.zeros_like
result[mask] = ts_mds

This option may be cheaper than a masked array because the initial masking step creates a 40*45*40-mask_size x 1000 array, and only replaces it into the masked area of the result when finished, instead of operating on the full sized data and preserving shape.

The third option is only really useful if you have only a small number of elements masked out. It's essentially what your original code is doing: perform all the commutations, and apply the mask to the result.

More Tips

Ellipsis is a special object that means "all the remaining dimensions". It's usually abbreviated ... in slice notation. np.newaxis is an alias for None. Combine those pieces of information, and you get that [: :, :, np.newaxis] can be written more cleanly and elegantly as [..., None]. The latter is more general since it works for an arbitrary number of dimensions.

Numpy allows for negative axis indices. A nicer way to say "last axis" is generally axis=-1.