python arrays python-3.x numpy vectorization

Use numpy masked array on an array of arrays without getting a flattened output

Consider the following code

x = np.array([[1, 2, 3], ['NaN', 4, 'NaN'], [7, 8, 9]])

# Convert 'NaN' strings to masked values
mask = np.ma.masked_where(x == 'NaN', x)

# Get a boolean array indicating where the original array is not masked
bool_arr = ~mask.Mask

# Filter the original array using the boolean array
filtered_arr = x[bool_arr]

print(filtered_arr)

The code above results in the following output

['1' '2' '3' '4' '7' '8' '9']

However I want my output to look as follows

[['1' '2' '3'],
 ['4'],
 ['7' '8' '9']]

Where am I going wrong?

Solution

You create an array of strings:

In [22]: x
Out[22]: 
array([['1', '2', '3'],
       ['NaN', '4', 'NaN'],
       ['7', '8', '9']], dtype='<U11')

and a masked array with the same strings:

In [23]: masked
Out[23]: 
masked_array(
  data=[['1', '2', '3'],
        [--, '4', --],
        ['7', '8', '9']],
  mask=[[False, False, False],
        [ True, False,  True],
        [False, False, False]],
  fill_value='N/A',
  dtype='<U11')

Masked arrays have a method for returning the values:

In [24]: masked.compressed()
Out[24]: array(['1', '2', '3', '4', '7', '8', '9'], dtype='<U11')

In [25]: masked.compressed?
Signature: masked.compressed()
Docstring:
Return all the non-masked data as a 1-D array.

All indexing with a boolean array returns a 1d array. It can't return a 2d array like the original, since in general each row may have a different number of elements. Specifically in your case you want a list with 3,1,3 sized elements. numpy whole array methods don't produce that kind of thing.

To get 1d arrays by row, you have to work row by row:

In [30]: [row[row!='NaN'] for row in x]
Out[30]: 
[array(['1', '2', '3'], dtype='<U11'),
 array(['4'], dtype='<U11'),
 array(['7', '8', '9'], dtype='<U11')]

Or may be you want to remove the 'nan' by column:

In [32]: [row[row!='NaN'] for row in x.T]
Out[32]: 
[array(['1', '7'], dtype='<U11'),
 array(['2', '4', '8'], dtype='<U11'),
 array(['3', '9'], dtype='<U11')]

Probably not, but do you see the inherent ambiguity in your quest?

If you cast the array to float:

In [34]: x = x.astype(float)
In [35]: x
Out[35]: 
array([[ 1.,  2.,  3.],
       [nan,  4., nan],
       [ 7.,  8.,  9.]])

and can mask with:

In [36]: np.ma.masked_invalid(x)
Out[36]: 
masked_array(
  data=[[1.0, 2.0, 3.0],
        [--, 4.0, --],
        [7.0, 8.0, 9.0]],
  mask=[[False, False, False],
        [ True, False,  True],
        [False, False, False]],
  fill_value=1e+20)

You still have flattening issue when it comes to extracting the non-masked values.

In [40]: np.ma.masked_invalid(x).compressed()
Out[40]: array([1., 2., 3., 4., 7., 8., 9.])

But there are a number of functions that let you work with the non-nan values of an array, such as taking the row-wise mean:

In [42]: np.nanmean(x,axis=1)
Out[42]: array([2., 4., 8.])

The list of lists (or arrays) that you desire looses most of the computational advantages that you normally get with a 2d array.