Search code examples
pythonpandasscikit-learncomparison

Comparing correlation errors of predictions different Silo's


I'm predicting the duration of a drying process in different silos. The support vector regression is retrained for a every silo, because they are more different than expected.

I want to check if there is a correlation between the 4 different silos, but I have a hard time getting all predictions into one dataframe.

The dataframe:

        df = pd.DataFrame({'date': ['2019-09-06', '2019-09-07', '2019-09-08', '2019-09-09', '2019-09-10'] , 'real_duration': [400, 420, 450,500, 380], 'silo_nr': [1, 2, 2, 3, 4], 'temp_outside' : [4, 5, 2, 6 , 15], 'zomer_winter_dummy': ["winter","winter","winter", "winter", "zommer"]})
        df.set_index('date', inplace= True)

My current code:

creation dataframe where i want to store all predictions:

df_predict[['real_duration', 'silo_nr']] = df[['real_duration', 'silo_nr']] 

                                
         

Not optimal, but current best plan is to repeat this code 4 times for all 4 silos:

X = df_test[df_test.silo == '1'][['moist', 'temp_outside', 'zomer_winter_dummy']]
X = pd.get_dummies(X, columns=['zomer_winter_dummy'], drop_first=True)
y = df_test[df_test.silo == '1']['real_duration']
model = SVR(kernel='poly', C= 1, epsilon = 1, gamma = 0.0001)
df_predict[df_predict.silo_nr =='1']['predict_kiln1'] = cross_val_predict(model, X, y, cv=5)

But I get the error:

ValueError: Length of values does not match length of index. 

I was trying to avoid this error by only selecting those rows of silo_nr == '1' to avoid this. Can you please suggest an improvement on my code? Thanks a lot !


Solution

  • Seems like you should be storing the resultant dataframes in a list or dictionary, then concatenating them together to create your output.

    Sample data:

    import pandas as pd
    import numpy as np
    from sklearn.svm import SVR
    from sklearn.model_selection import cross_val_predict
    
    df = pd.DataFrame({
        'date': pd.date_range('2019-09-06', '2019-10-15'),
        'real_duration': np.random.randint(100, 500, size=40), 
        'silo_nr': np.repeat([1,2,3,4], 10), 
        'temp_outside' : np.random.randint(5, 20, size=40), 
        'zomer_winter_dummy': np.random.choice(["winter", "zommer"], size=40)
    })
    
    df = df.set_index('date')
    
    print(df.head())
    
                real_duration  silo_nr  temp_outside zomer_winter_dummy
    date                                                               
    2019-09-06            165        1            14             zommer
    2019-09-07            370        1            19             zommer
    2019-09-08            484        1             6             zommer
    2019-09-09            358        1            16             zommer
    2019-09-10            244        1             7             winter
    

    groupby "silo_nr" and apply the model fit to each groups data. Store the resultant arrays into a dictionary where the key is the silo_nr value and the value is a Series of the predicted array. The series takes on the index of the input data to align for later use. Then use pd.concat on that dictionary to join all of the Series together into a single dataframe.

    model = SVR(kernel='poly', C= 1, epsilon = 1, gamma = 0.0001)
    
    results = {}
    for silo, group in df.groupby('silo_nr'):
        # whats the point of subsetting if you overwrite X on the next line?
        # X = group[['moist', 'temp_outside', 'zomer_winter_dummy']] 
        X = pd.get_dummies(group, columns=['zomer_winter_dummy'], drop_first=True)
        y = group['real_duration']
    
        results[silo] = pd.Series(cross_val_predict(model, X, y, cv=5), index=group.index, name="prediction")
    
    result_df = pd.concat(results, names=["silo_nr"]).reset_index(level=0)
    
    print(result_df.head(10)
                silo_nr  prediction
    date                           
    2019-09-06        1  237.048891
    2019-09-07        1  348.963419
    2019-09-08        1  539.037233
    2019-09-09        1  338.392746
    2019-09-10        1  259.937936
    2019-09-11        1  359.046132
    2019-09-12        1  432.049543
    2019-09-13        1  573.457120
    2019-09-14        1  279.745391
    2019-09-15        1  292.807550
    

    Lastly, if you want to join your predictions back with the original data, you can use the join function since the index of both dataframes is the same (just need to drop one of the "silo_nr" columns since its present in both dataframes first).

    final_df = result_df.drop("silo_nr", axis=1).join(df)
    
    print(final_df.head(10))
                prediction  real_duration  silo_nr  temp_outside zomer_winter_dummy
    date                                                                           
    2019-09-06  237.048891            165        1            14             zommer
    2019-09-07  348.963419            370        1            19             zommer
    2019-09-08  539.037233            484        1             6             zommer
    2019-09-09  338.392746            358        1            16             zommer
    2019-09-10  259.937936            244        1             7             winter
    2019-09-11  359.046132            382        1            11             zommer
    2019-09-12  432.049543            414        1            17             zommer
    2019-09-13  573.457120            487        1            17             winter
    2019-09-14  279.745391            237        1            10             winter
    2019-09-15  292.807550            274        1             5             winter
    

    To calculate the difference between predicted and observed results, you can calculate that on final_df as you have done. However in order to pivot the results appropriately we're going to need to replace the index. In my data, none of the samples were measured on the same day meaning that "date" is not a particularly good index for these data (if you have a measurement for each silo on each date, then I would use "date" as the index).

    final_df = (final_df
                .assign(
                    prediction_difference=lambda df: df["real_duration"] - df["prediction"],
                    measurement_count=lambda df: df.groupby("silo_nr").cumcount()
                )
                .pivot(index="measurement_count", columns="silo_nr", values=["real_duration", "prediction"])
               )
    
    print(final_df)
                      real_duration                       prediction                                    
    silo_nr                       1      2      3      4           1           2           3           4
    measurement_count                                                                                   
    0                         497.0  161.0  178.0  174.0  552.684521  238.446608  209.933437  231.625109
    1                         377.0  486.0  166.0  227.0  373.993094  496.397113  206.773769  246.807474
    2                         429.0  490.0  166.0  312.0  419.621826  487.391489  174.880098  286.195194
    3                         176.0  196.0  232.0  388.0  268.697640  247.515843  205.233186  368.080966
    4                         377.0  435.0  414.0  437.0  368.084024  413.005861  432.138127  442.188875
    5                         470.0  177.0  485.0  417.0  484.663314  243.078444  602.527207  410.407151
    6                         325.0  354.0  349.0  458.0  331.416194  293.509960  310.791084  483.367988
    7                         318.0  499.0  236.0  235.0  326.878359  507.602317  198.263842  247.611772
    8                         181.0  357.0  100.0  266.0  265.770711  300.769953  162.219113  255.572195
    9                         352.0  280.0  393.0  264.0  345.185827  233.542568  376.882980  248.664156