python numpy missing-data astropy model-fitting

How to fit 2-D function if some of the data points are NaNs?

I am trying to fit a 2-D surface to a data. More specifically, I want to find a function which maps pixel coordinate to wavelength coordinate, just as FITCOORDS in IRAF does.

As an example, I want to find the fit to test array in the following snippet:

import numpy as np
from astropy.modeling.models import Chebyshev2D
from astropy.modeling.fitting import LevMarLSQFitter
#%%
test = np.array([[7473, 7040, 6613, 6183, 5753, 5321, 4888],
   [7474, 7042, 6616, 6186, np.nan, 5325, 4893],
   [7476, 7044, 6619, 6189, 5759, 5328, 4897],
   [7479, 7047, np.nan, 6192, 5762, 5331, 4900]])
grid_pix, grid_wave = np.mgrid[:4, :7]
fitter = LevMarLSQFitter()
c2_init = Chebyshev2D(x_degree=3, y_degree=3)
c2_fit = fitter(c2_init, grid_wave, grid_pix, test)
print(c2_fit)

ResultI with astropy 2.0.2 and numpy 1.13.3 on Python 3.6:

Model: Chebyshev2D
Inputs: ('x', 'y')
Outputs: ('z',)
Model set size: 1
X-Degree: 3
Y-Degree: 3
Parameters:
    c0_0 c1_0 c2_0 c3_0 c0_1 c1_1 c2_1 c3_1 c0_2 c1_2 c2_2 c3_2 c0_3 c1_3 c2_3 c3_3
    ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ----
     0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
WARNING: Model is linear in parameters; consider using linear fitting methods. [astropy.modeling.fitting]

It's evident that the fitting has never worked.

If I change the np.nan to certain values, the fit works as expected (e.g., manually change np.nan's to 0, 1, etc).

How should I get a reasonable result? Can I let the fitter to ignore np.nan values?

Solution

You can simply remove the nans in your data and the corresponding "indices" of your grid. For example with boolean indexing:

notnans = np.isfinite(test)  # array containing True for finite values and False for nans/infs
c2_fit = fitter(c2_init, grid_wave[notnans], grid_pix[notnans], test[notnans])
print(c2_fit)

It still prints a Warning about linear in parameters but the values are definitely nonzero:

Model: Chebyshev2D
Inputs: ('x', 'y')
Outputs: ('z',)
Model set size: 1
X-Degree: 3
Y-Degree: 3
Parameters:
         c0_0          c1_0           c2_0      ...       c2_3            c3_3      
    ------------- -------------- -------------- ... --------------- ----------------
    7473.01546325 -431.633443323 0.471726190475 ... 0.0229037267082 -0.0012077294686

The trick here is that x, y and your data don't need to be 2D arrays, they can be 1D arrays (as returned by boolean indexing) as long as they "represent" a 2D grid.

Just in case you have "big regions" containing NaNs this approach might not be good enough because the fitter could fit anything there. If that's the case you could interpolate over these areas using astropy.convolution.convolve and then replacing the NaNs of your data with the result of convolve:

from astropy.convolution import convolve
# Just for illustration I used a 5x5 mean filter here, the size must be adjusted
# depending on the size of all-nan-regions in your data.
mean_convolved = convolve(test, np.ones((5, 5)), boundary='extend')
# Replacing NaNs in test with the mean_convolved values:
nan_mask = np.isnan(test)
test[nan_mask] = mean_convolved[nan_mask]

# Now pass test to the fitter:
c2_fit = fitter(c2_init, grid_wave, grid_pix, test)

However for a few sparsely distributed NaNs a convolution isn't necessary. You probably need to compare the two approaches and see which one gives more "trustworthy" results. Missing values can be a real problem with fitting.