Search code examples
pythonpandasdataframenan

Loss of precision when Interpolating between valid points


My test.csv contains many NaNs:

"Time","Y1","Y2","Y3"
"s","celsius","celsius","celsius"
"0.193","","",""
"0.697","","1",""
"1.074","","","-27"
"1.579","10","",""
"2.083","","5",""
"3.123","15","","-28"
"5.003","","",""

When I try to fill the missing data between valid points, using interpolate, it fills it with whole integers:

import pandas as pd
df = pd.read_csv("test.csv")
df.loc[1:, "Y3"] = pd.to_numeric(df.loc[1:, "Y3"])
df.loc[1:, "Y3"] =  df.loc[1:, "Y3"].interpolate(method='linear').ffill()  #method='time' , method='index'

>>> print (df)
    Time       Y1       Y2       Y3
0      s  celsius  celsius  celsius
1  0.193      NaN      NaN      NaN
2  0.697      NaN        1      NaN
3  1.074      NaN      NaN      -27
4  1.579       10      NaN      -27  <<-----
5  2.083      NaN        5      -27  <<-----
6  3.123       15      NaN      -28
7  5.003      NaN      NaN      -28

I can fix the Nans at the start of the columns with bfill, but how can I fill the points between -27 and -28 with fractional values like -27.3, -27.6?


Solution

  • The issue is that you have strings in the first row.

    df.loc[1:, "Y3"] = pd.to_numeric(df.loc[1:, "Y3"]) doesn't change the dtype to a numeric one.

    You should not put headers as a row, use a MultiIndex:

    df = pd.read_csv("test.csv", header=[0, 1])
    

    Then:

    df['Y3'] = df['Y3'].interpolate(method='linear').ffill()
    

    Output:

        Time      Y1      Y2         Y3
           s celsius celsius    celsius
    0  0.193     NaN     NaN        NaN
    1  0.697     NaN     1.0        NaN
    2  1.074     NaN     NaN -27.000000
    3  1.579    10.0     NaN -27.333333
    4  2.083     NaN     5.0 -27.666667
    5  3.123    15.0     NaN -28.000000
    6  5.003     NaN     NaN -28.000000