Consider the following code
x = np.array([[1, 2, 3], ['NaN', 4, 'NaN'], [7, 8, 9]])
# Convert 'NaN' strings to masked values
mask = np.ma.masked_where(x == 'NaN', x)
# Get a boolean array indicating where the original array is not masked
bool_arr = ~mask.Mask
# Filter the original array using the boolean array
filtered_arr = x[bool_arr]
print(filtered_arr)
The code above results in the following output
['1' '2' '3' '4' '7' '8' '9']
However I want my output to look as follows
[['1' '2' '3'],
['4'],
['7' '8' '9']]
Where am I going wrong?
You create an array of strings:
In [22]: x
Out[22]:
array([['1', '2', '3'],
['NaN', '4', 'NaN'],
['7', '8', '9']], dtype='<U11')
and a masked array with the same strings:
In [23]: masked
Out[23]:
masked_array(
data=[['1', '2', '3'],
[--, '4', --],
['7', '8', '9']],
mask=[[False, False, False],
[ True, False, True],
[False, False, False]],
fill_value='N/A',
dtype='<U11')
Masked arrays have a method for returning the values:
In [24]: masked.compressed()
Out[24]: array(['1', '2', '3', '4', '7', '8', '9'], dtype='<U11')
In [25]: masked.compressed?
Signature: masked.compressed()
Docstring:
Return all the non-masked data as a 1-D array.
All indexing with a boolean array returns a 1d array. It can't return a 2d array like the original, since in general each row may have a different number of elements. Specifically in your case you want a list with 3,1,3 sized elements. numpy
whole array methods don't produce that kind of thing.
To get 1d arrays by row, you have to work row by row:
In [30]: [row[row!='NaN'] for row in x]
Out[30]:
[array(['1', '2', '3'], dtype='<U11'),
array(['4'], dtype='<U11'),
array(['7', '8', '9'], dtype='<U11')]
Or may be you want to remove the 'nan' by column:
In [32]: [row[row!='NaN'] for row in x.T]
Out[32]:
[array(['1', '7'], dtype='<U11'),
array(['2', '4', '8'], dtype='<U11'),
array(['3', '9'], dtype='<U11')]
Probably not, but do you see the inherent ambiguity in your quest?
If you cast the array to float:
In [34]: x = x.astype(float)
In [35]: x
Out[35]:
array([[ 1., 2., 3.],
[nan, 4., nan],
[ 7., 8., 9.]])
and can mask with:
In [36]: np.ma.masked_invalid(x)
Out[36]:
masked_array(
data=[[1.0, 2.0, 3.0],
[--, 4.0, --],
[7.0, 8.0, 9.0]],
mask=[[False, False, False],
[ True, False, True],
[False, False, False]],
fill_value=1e+20)
You still have flattening issue when it comes to extracting the non-masked values.
In [40]: np.ma.masked_invalid(x).compressed()
Out[40]: array([1., 2., 3., 4., 7., 8., 9.])
But there are a number of functions that let you work with the non-nan values of an array, such as taking the row-wise mean:
In [42]: np.nanmean(x,axis=1)
Out[42]: array([2., 4., 8.])
The list of lists (or arrays) that you desire looses most of the computational advantages that you normally get with a 2d array.