import numpy as np
ts = np.random.rand(40,45,40,1000)
mask = np.random.randint(2, size=(40,45,40),dtype=bool)
#creating a masked array
ts_m = np.ma.array(ts, mask=ts*~mask[:,:,:,np.newaxis])
#demeaning
ts_md = ts_m - ts_m.mean(axis=3)[:,:,:,np.newaxis]
#standardisation
ts_mds = ts_md / ts_md.std(ddof=1,axis=3)[:,:,:,np.newaxis]
I would like to demean ts (along axis 3), and divide by its standard deviation (along axis 3), all within the mask.
Am I doing this correctly ?
Is there a faster method ?
You have a couple of options available to you.
The first is to use masked arrays as you are doing, but provide a proper mask and use the masked functions. Right now, your code is computing all the means and standard deviations, and slapping a mask on the result. To skip masked elements, use np.ma.mean
and np.ma.std
, and thereby avoid doing a whole lot of extra work.
As you correctly understood, the size of the mask must match that of the data. While multiplying by the data gives you the correct size, it is expensive and gives the wrong result in the general case since the mask will be zero whenever either data or mask is zero. A better approach would be to create a view of the mask repeated along the last (new) dimension. You can use np.broadcast_to
if you get the trailing dimensions to match up first:
ts = np.random.rand(40, 45, 40, 1000)
mask = np.random.randint(2, size=(40, 45, 40), dtype=np.bool)
#creating a masked array
ts_m = np.ma.array(ts, mask=np.broadcast_to(mask[..., None], ts.shape)
#demeaning
ts_md = ts_m - np.ma.mean(ts_m, axis=3)[..., None]
#standardisation
ts_mds = ts_md / np.ma.std(ts_m, ddof=1,axis=3)[..., None]
The mask is read only, and because it likely has a dimension with zero stride, can sometimes do unexpected things. The broadcasted version here is roughly equivalent to
np.lib.stride_tricks.as_strided(mask, ts.shape, (*mask.strides, 0), writeable=False)
Both versions create views to the original data, so are very fast. They just allocate a new array object that points to the existing data, which is not copied. Keep in mind that np.lib.stride_tricks.as_strided
is a sledgehammer that should be used with the utmost care. It will crash your interpreted any day if you let it.
Note: The mask in a masked array is interpreted as True
being masked, while Boolean indexing arrays are interpreted with False
masked. Depending on how it's obtained and it's meaning in your real code, you may want to invert the mask
mask=np.broadcast_to(~mask[..., None], ...)
Another option is to implement the masking yourself. There are two ways you can do that. If you do it up-front, the mask will be applied to the leading dimensions of your data:
ts = np.random.rand(40, 45, 40, 1000)
mask = np.random.randint(2, size=(40, 45, 40), dtype=np.bool)
#creating a masked array
mask = ~mask # optional, see note above
ts_m = ts[mask]
#demeaning
ts_md = ts_m - ts_m.mean(axis=-1)
#standardisation
ts_mds = ts_md / ts_md.std(ddof=1,axis=-1)
# reshaping
result = np.empty_like(ts) # alternatively, np.zeros_like
result[mask] = ts_mds
This option may be cheaper than a masked array because the initial masking step creates a 40*45*40-mask_size x 1000
array, and only replaces it into the masked area of the result when finished, instead of operating on the full sized data and preserving shape.
The third option is only really useful if you have only a small number of elements masked out. It's essentially what your original code is doing: perform all the commutations, and apply the mask to the result.
More Tips
Ellipsis
is a special object that means "all the remaining dimensions". It's usually abbreviated ...
in slice notation. np.newaxis
is an alias for None
. Combine those pieces of information, and you get that [: :, :, np.newaxis]
can be written more cleanly and elegantly as [..., None]
. The latter is more general since it works for an arbitrary number of dimensions.
Numpy allows for negative axis indices. A nicer way to say "last axis" is generally axis=-1
.