I have following dataset:
ID DATE VALUE
1 01-01-2020 98
1 02-01-2020 96
1 03-01-2020 94
2 01-01-2020 99
2 02-01-2020 90
2 05-01-2020 85
For each id I want to calculate the date where the value will reach 0 through linear regression. I succeeded in plotting an extended line, where I can visually through following code see the point where it reaches 0, but I don't succeed in extracting the exact date.
For now my code looks like this:
# convert date to ordinal
df['DATE_ord']=df['DATE'].map(dt.datetime.toordinal)
# Create plot with current values
ax = sns.regplot(data=df,x='DATE_ord',y='VALUE')
# Set x-axis values (ordinal dates)
date_range = np.arange(730000, 760000, 1000)
# Calculate extended values
slope, intercept, r_value, p_value, std_err = stats.linregress(x = df[
"DATE_ord"].values, y = df["VALUE"].values)
line = [slope*xi + intercept for xi in date_range]
plt.plot(date_range, line, label="Fitting Line", linewidth=1)
# Convert x-axis values back to date
ax.set_xlabel('date')
new_labels = [date.fromordinal(int(item)) for item in ax.get_xticks()]
ax.set_xticklabels(new_labels)
I now need to find the date where the Y-value (VALUE) reaches 0. I found this post which more or less wants to achieve the same, but this doesn't work because the values in my slope are a different number as the predicted values for my x-axis, and I'm at a loss on how to make this work. The only thing I can think of is to calculate it with the regression expression and to fill in the numbers, but I'm not sure how to do this in a way that would work in a for-loop which I still need to create to loop through the different ID's.
Let's try to do this with pandas only, without looping
Load sample data
import pandas as pd
from io import StringIO
data = StringIO(
'''
ID DATE VALUE
1 01-01-2020 98
1 02-01-2020 96
1 03-01-2020 94
2 01-01-2020 99
2 02-01-2020 90
2 05-01-2020 85
''')
df = pd.read_csv(data, sep = '\s+')
df['DATE'] = pd.to_datetime(df['DATE'], dayfirst = True)
generate ordinals and calculate regression coefficients per group:
import datetime as dt
from scipy import stats
df['DATE_ord']=df['DATE'].map(dt.datetime.toordinal)
dfr = (df.groupby('ID').apply(lambda g: stats.linregress(x = g[
"DATE_ord"].values, y = g["VALUE"].values)[0:2])
.apply(pd.Series)
.rename(columns = {0 : 'slope', 1:'intercept'}))
dfr
we get this
slope intercept
ID
1 -2.000000 1.474948e+06
2 -3.076923 2.269096e+06
Now the 'zero' date is just as a solution to 0 = s * x + i which is x0 = -intercept/slope. we calculate that in ordinals and convert back to a date
dfr['zd_ord'] = -dfr['intercept']/dfr['slope']
dfr['zd'] = dfr['zd_ord'].astype(int).map(dt.datetime.fromordinal)
to get
slope intercept zd_ord zd
ID
1 -2.000000 1.474948e+06 737474.00 2020-02-19
2 -3.076923 2.269096e+06 737456.35 2020-02-01