Search code examples
pythonpandasdataframelinear-regression

add rows in pandas with values from a linear regression based on other rows


I have a dataframe with two columns Date_of_journey and Price. The column Date_of_journey takes values between 1 and 119 but it has only 37 rows. So a lot of dates are missing.

Is there a simple way to add those dates where the price is somewhere in between the previous and next row?

Here is a plot of the data to give you an idea. I would like to add a row with Date_of_journey=4 and 5 with a Price that fits the gray curve. enter image description here


Solution

  • You could resample your pd.DataFrame to a new range using RangeIndex() and interpolate between the known values using pd.interpolate(method='linear'). With more data you 'll get a plot similar to yours.

    import pandas as pd
    import io
    
    data = """Date_of_Journey   Price
    1   24089.333333
    3   14873.397727
    6   14035.232877
    9   13178.641509
    15  5785.500000"""
    
    df = pd.read_csv(io.StringIO(data), delimiter='\t', index_col='Date_of_Journey')
    df = df.reindex(pd.RangeIndex(start=1, stop=119,step=1))
    df.interpolate(method='linear', inplace=True)
    
    df.plot(y='Price')
    

    Output: Plot based on 5 datapoints