Search code examples
pythonpandaslistfor-loopgroup-by

Find Future Value for Unique Column Entries in Dataframe using Linear Regression


In the "Example Dataframe" below there are three unique labels in the "Label" column: "A", "B" and "C".

I would like to predict the future value of "Value1" for "A", "B" and "C" when "Value2" is 65000000 using linear regression.

Example Dataframe

import pandas as pd
data = {'Label': ['A','A','A','A','A','A','B','B','B','B','B','B','C','C','C','C','C','C'],
        'Value1': ['1672964520','1672966620','1672967460','1672969380','1672971840',
                   '1672972200','1672963800','1672966140', '1672967760','1672969020',
                   '1672970520', '1672971360','1672963200','1672964700','1672966260',
                   '1672967820', '1672969980', '1672971180'],
        'Value2': ['54727520', '54729380', '54740070', '54744720', '54775410', '54779130',
                   '59598560','59603190','59605060','59611320','59628900','59630950',
                   '58047810','58049680','58051550','58058460','58068740','58088280']}
df=pd.DataFrame(data)
print (df)

I can predict the future value of "Value1" when "A" is the only label in the dataframe (see below).

But, I am getting hung up applying this methodology with the example dataframe. Is there an easy way to modify this code to predict "Value1" for any labels found in the example dataframe?

Desired output Value1: Predicted value of A = "X", B = "Y", C = "Z"... etc

data = {'Label': ['A','A','A','A','A','A',],
        'Value1': ['1672964520','1672966620','1672967460','1672969380','1672971840', '1672972200'],
        'Value2': ['54727520', '54729380', '54740070', '54744720', '54775410', '54779130']}

# Create dataframe using data
df = pd.DataFrame(data)
# Change Value1 and Value2 from obj to float64
df["Value1"] = df.Value1.astype("int64")
df["Value2"] = df.Value2.astype("int")
# Calc means for x and y respectively
xmean = np.mean(df["Value1"])
ymean = np.mean(df["Value2"])
# Calc numerator and denominator of beta
df["xyCov"] = (df["Value1"] - xmean) * (df["Value2"] - ymean)
df["xVar"] = (df["Value1"] - xmean) ** 2
# Calc beta and alpha
beta = df["xyCov"].sum() / df["xVar"].sum()
alpha = ymean - (beta * xmean)
# Calc due date timestamp
Predicted_Value1 = (65000000 - alpha) / beta
# Convert timestamp to datetime
print("Future A value", Predicted_Value1)

Solution

  • Here is one way to do it with your example dataframe with Pandas groupby and Python f-strings:

    for label, df_ in df.groupby("Label"):
        # Change Value1 and Value2 from obj to float64
        df_["Value1"] = df_.Value1.astype("int64")
        df_["Value2"] = df_.Value2.astype("int")
    
        # Calc means for x and y respectively
        xmean = np.mean(df_["Value1"])
        ymean = np.mean(df_["Value2"])
    
        # Calc numerator and denominator of beta
        df_["xyCov"] = (df_["Value1"] - xmean) * (df_["Value2"] - ymean)
        df_["xVar"] = (df_["Value2"] - xmean) ** 2
    
        # Calc beta and alpha
        beta = df_["xyCov"].sum() / df_["xVar"].sum()
        alpha = ymean - (beta * xmean)
    
        # Calc anode due date timestamp
        Predicted_Value1 = (65000000 - alpha) / beta
    
        # Convert timestamp to datetime
        print(f"Future {label} value", Predicted_Value1)
    

    Which outputs:

    Future A value 4.922122808656915e+17
    Future B value 4.68780950852079e+17
    Future C value 4.970684516509964e+17