Search code examples
pandasinterpolationlinear-interpolation

Pandas interpolation type when method='index'?


The pandas documentation indicates that when method='index', the numerical values of the index are used. However, I haven't found any indication of the underlying interpolation method employed. It looks like it uses linear interpolation. Can anyone confirm this definitively or point me to where this is stated in the documentation?


Solution

  • So turns out the document is bit misleading for those who read it will likely to think:

    ‘index’, ‘values’: use the actual numerical values of the index.

    as fill the NaN values with numerical values of the index which is not correct, we should read it as linear interpolate value use the actual numerical values of the index

    The difference between method='linear' and method='index' in source code of pandas.DataFrame.interpolate mainly are in following code:

    if method == "linear":
    # prior default
        index = np.arange(len(obj.index))
        index = Index(index)
    else:
        index = obj.index
    

    So if you using the default RangeIndex as index of the dataframe, then interpolate results of method='linear' and method='index' will be the same, however if you specify the different index then results will not be the same, following example will show you the difference clearly:

    import pandas as pd
    import numpy as np
    
    d = {'val': [1, np.nan, 3]}
    df0 = pd.DataFrame(d)
    df1 = pd.DataFrame(d, [0, 1, 6])
    
    print("df0:\nmethod_index:\n{}\nmethod_linear:\n{}\n".format(df0.interpolate(method='index'), df0.interpolate(method='linear')))
    print("df1:\nmethod_index:\n{}\nmethod_linear:\n{}\n".format(df1.interpolate(method='index'), df1.interpolate(method='linear')))
    

    Outputs:

    df0:
    method_index:
       val
    0  1.0
    1  2.0
    2  3.0
    method_linear:
       val
    0  1.0
    1  2.0
    2  3.0
    
    df1:
    method_index:
       val
    1  1.000000
    2  1.333333
    6  3.000000
    method_linear:
       val
    1  1.0
    2  2.0
    6  3.0
    

    As you can see, when index=[0, 1, 6] with val=[1.0, 2.0, 3.0], the interpolated value is 1.0 + (3.0-1.0) / (6-0) = 1.333333

    Following the runtime of the pandas source code (generic.py -> managers.py -> blocks.py -> missing.py), we can find the implementation of linear interpolate value use the actual numerical values of the index:

    NP_METHODS = ["linear", "time", "index", "values"]
    
    if method in NP_METHODS:
        # np.interp requires sorted X values, #21037
        indexer = np.argsort(inds[valid])
        result[invalid] = np.interp(
            inds[invalid], inds[valid][indexer], yvalues[valid][indexer]
        )