Search code examples
pythonarrayspython-3.xnumpyvectorization

Use numpy masked array on an array of arrays without getting a flattened output


Consider the following code

x = np.array([[1, 2, 3], ['NaN', 4, 'NaN'], [7, 8, 9]])

# Convert 'NaN' strings to masked values
mask = np.ma.masked_where(x == 'NaN', x)

# Get a boolean array indicating where the original array is not masked
bool_arr = ~mask.Mask

# Filter the original array using the boolean array
filtered_arr = x[bool_arr]

print(filtered_arr)

The code above results in the following output

['1' '2' '3' '4' '7' '8' '9']

However I want my output to look as follows

[['1' '2' '3'],
 ['4'],
 ['7' '8' '9']]

Where am I going wrong?


Solution

  • You create an array of strings:

    In [22]: x
    Out[22]: 
    array([['1', '2', '3'],
           ['NaN', '4', 'NaN'],
           ['7', '8', '9']], dtype='<U11')
    

    and a masked array with the same strings:

    In [23]: masked
    Out[23]: 
    masked_array(
      data=[['1', '2', '3'],
            [--, '4', --],
            ['7', '8', '9']],
      mask=[[False, False, False],
            [ True, False,  True],
            [False, False, False]],
      fill_value='N/A',
      dtype='<U11')
    

    Masked arrays have a method for returning the values:

    In [24]: masked.compressed()
    Out[24]: array(['1', '2', '3', '4', '7', '8', '9'], dtype='<U11')
    
    In [25]: masked.compressed?
    Signature: masked.compressed()
    Docstring:
    Return all the non-masked data as a 1-D array.
    

    All indexing with a boolean array returns a 1d array. It can't return a 2d array like the original, since in general each row may have a different number of elements. Specifically in your case you want a list with 3,1,3 sized elements. numpy whole array methods don't produce that kind of thing.

    To get 1d arrays by row, you have to work row by row:

    In [30]: [row[row!='NaN'] for row in x]
    Out[30]: 
    [array(['1', '2', '3'], dtype='<U11'),
     array(['4'], dtype='<U11'),
     array(['7', '8', '9'], dtype='<U11')]
    

    Or may be you want to remove the 'nan' by column:

    In [32]: [row[row!='NaN'] for row in x.T]
    Out[32]: 
    [array(['1', '7'], dtype='<U11'),
     array(['2', '4', '8'], dtype='<U11'),
     array(['3', '9'], dtype='<U11')]
    

    Probably not, but do you see the inherent ambiguity in your quest?

    If you cast the array to float:

    In [34]: x = x.astype(float)
    In [35]: x
    Out[35]: 
    array([[ 1.,  2.,  3.],
           [nan,  4., nan],
           [ 7.,  8.,  9.]])
    

    and can mask with:

    In [36]: np.ma.masked_invalid(x)
    Out[36]: 
    masked_array(
      data=[[1.0, 2.0, 3.0],
            [--, 4.0, --],
            [7.0, 8.0, 9.0]],
      mask=[[False, False, False],
            [ True, False,  True],
            [False, False, False]],
      fill_value=1e+20)
    

    You still have flattening issue when it comes to extracting the non-masked values.

    In [40]: np.ma.masked_invalid(x).compressed()
    Out[40]: array([1., 2., 3., 4., 7., 8., 9.])
    

    But there are a number of functions that let you work with the non-nan values of an array, such as taking the row-wise mean:

    In [42]: np.nanmean(x,axis=1)
    Out[42]: array([2., 4., 8.])
    

    The list of lists (or arrays) that you desire looses most of the computational advantages that you normally get with a 2d array.