Search code examples
pythonpandaslinear-regression

Find x-value where y = 0 for extended regression


I have following dataset:

ID        DATE            VALUE
1         01-01-2020      98
1         02-01-2020      96
1         03-01-2020      94
2         01-01-2020      99
2         02-01-2020      90
2         05-01-2020      85

For each id I want to calculate the date where the value will reach 0 through linear regression. I succeeded in plotting an extended line, where I can visually through following code see the point where it reaches 0, but I don't succeed in extracting the exact date.

For now my code looks like this:

# convert date to ordinal 
df['DATE_ord']=df['DATE'].map(dt.datetime.toordinal)

# Create plot with current values
ax = sns.regplot(data=df,x='DATE_ord',y='VALUE')

# Set x-axis values (ordinal dates)
date_range = np.arange(730000, 760000, 1000)

# Calculate extended values
slope, intercept, r_value, p_value, std_err = stats.linregress(x = df[
        "DATE_ord"].values, y = df["VALUE"].values)

line = [slope*xi + intercept for xi in date_range]

plt.plot(date_range, line, label="Fitting Line", linewidth=1)

# Convert x-axis values back to date
ax.set_xlabel('date')
new_labels = [date.fromordinal(int(item)) for item in ax.get_xticks()]
ax.set_xticklabels(new_labels)

I now need to find the date where the Y-value (VALUE) reaches 0. I found this post which more or less wants to achieve the same, but this doesn't work because the values in my slope are a different number as the predicted values for my x-axis, and I'm at a loss on how to make this work. The only thing I can think of is to calculate it with the regression expression and to fill in the numbers, but I'm not sure how to do this in a way that would work in a for-loop which I still need to create to loop through the different ID's.


Solution

  • Let's try to do this with pandas only, without looping

    Load sample data

    import pandas as pd
    from io import StringIO
    data = StringIO(
    '''
    ID        DATE            VALUE
    1         01-01-2020      98
    1         02-01-2020      96
    1         03-01-2020      94
    2         01-01-2020      99
    2         02-01-2020      90
    2         05-01-2020      85
    ''')
    df = pd.read_csv(data, sep = '\s+')
    df['DATE'] = pd.to_datetime(df['DATE'], dayfirst = True)
    

    generate ordinals and calculate regression coefficients per group:

    import datetime as dt
    from scipy import stats
    df['DATE_ord']=df['DATE'].map(dt.datetime.toordinal)
    dfr = (df.groupby('ID').apply(lambda g: stats.linregress(x = g[
            "DATE_ord"].values, y = g["VALUE"].values)[0:2])
      .apply(pd.Series)
      .rename(columns = {0 : 'slope', 1:'intercept'}))
    dfr
    

    we get this

    
         slope      intercept
    ID      
    1   -2.000000   1.474948e+06
    2   -3.076923   2.269096e+06
    

    Now the 'zero' date is just as a solution to 0 = s * x + i which is x0 = -intercept/slope. we calculate that in ordinals and convert back to a date

    dfr['zd_ord'] = -dfr['intercept']/dfr['slope']
    dfr['zd'] = dfr['zd_ord'].astype(int).map(dt.datetime.fromordinal)
    

    to get

    
         slope      intercept       zd_ord      zd
    ID              
    1   -2.000000   1.474948e+06    737474.00   2020-02-19
    2   -3.076923   2.269096e+06    737456.35   2020-02-01