Search code examples
pythonpandascurve-fittingstatsmodelsbinning

Binning data into equal box sizes and apply OLS to each bin


I have a DataFrame df1:

import pandas as pd
import numpy as np
import statsmodels.formula.api as sm

df1 = pd.DataFrame( np.random.randn(3000,1), index= pd.date_range('1/1/1990', periods=3000), columns = {"M"})

I would like to group elements in a box size = 10, fit them using OLS and compute Y_t, where Y_tstands for the series of straight line fits.

In other words, I would like to take the first 10 values, fit them using OLS ( Y_t = b*X_t+a_0) and obtain the values Y_t for these 10 values. Again do the same for the next 10 values (not a rolling window!), and so on and so forth.

My approach

The first issue that I faced was that I could not fit elements using DateTime values as predictors, so I defined a new DataFrame df_fit that contains two columns Aand B. Column Acontains integers from 0 to 9, and column Bthe values of df1 in groups of 10 elements:

 def compute_yt(df,i,bs):

    df_fit = pd.DataFrame({"B": np.arange(1,bs+1),\
                           "A": df.reset_index().loc[i*bs:((i+1)*bs-1), "M"]})

    fit = sm.ols(formula = "A ~ B", data = df_fit).fit()
    yt = fit.params.B*df_fit["B"] + fit.params.Intercept

    return yt

Where bs is the box size (10 in this example), iis an index that allows to sweep over all values.

Finally,

 result = [compute_yt(df1,n,l) for n in np.arange(0,round(len(df1)/l)-1)]           

 result =    
      Name: B, dtype: float64, 840   -0.249590
      841   -0.249935
      842   -0.250280
      843   -0.250625
      844   -0.250970
      845   -0.251315
      846   -0.251660
      847   -0.252005
      848   -0.252350
      849   -0.252695
      Name: B, dtype: float64, 850   -0.252631
      851   -0.252408
      ...    ... 

Where resultis a list that should contain the values for the straight line fits.

So, my questions are the following:

  1. Is there a way to run an OLS using DateTime values as predictors?

  2. I would like to use the list comprehension to build a DataFrame (with the same shape as df1) containing the values of y_t. This relates to question (1) in the sense that I would like to obtain a time-series for these values.

  3. Is there a more "pythonic" way to write this code? The way I have sliced the dataframe does not seem too much suitable.


Solution

  • Not really sure if this is what you wanted to do but I first added a group number and an observation number to each row of your dataframe and then pivoted it so that every row had 10 observations.

    df1 = pd.DataFrame( data={'M':np.random.randn(3000)}, index= pd.date_range('1/1/1990', periods=3000))
    
    df1['group_num'] = np.repeat(range(300), 10)
    df1['obs_num'] = np.tile(range(10), 300)
    
    df_pivot = df1.pivot(index='group_num', columns='obs_num')
    print(df_pivot.head())
    

    Output

                      M                                                    \
    obs_num           0         1         2         3         4         5   
    group_num                                                               
    0         -0.063775 -1.293410  0.395011 -1.224491  1.777335 -2.395643   
    1         -1.111679  1.668670  1.864227 -1.555251  0.959276  0.615344   
    2         -0.213891 -0.733493  0.175590  0.561410  1.359565 -1.341193   
    3          0.534735 -2.154626 -1.226191 -0.309502  1.368085  0.769155   
    4         -0.611289 -0.545276 -1.924381  0.383596  0.322731  0.989450   
    
    
    obs_num           6         7         8         9  
    group_num                                          
    0         -1.461194 -0.481617 -1.101098  1.102030  
    1         -0.120995 -1.046757  1.286074 -0.832990  
    2          0.322485 -0.825315 -2.277746 -0.619008  
    3          0.794694  0.912190 -1.006603  0.572619  
    4         -1.191902  1.229913  1.105221  0.899331 
    

    I then wrote a function to do ordinary least squares with statsmodels - not the formula type.

    import statsmodels.api as sm
    def compute_yt(row):
        X = sm.add_constant(np.arange(10))
        fit = sm.OLS(row.values, X).fit()
        yt = fit.params[1] * row.values + fit.params[0]
        return yt
    

    I then called this function over all the rows via apply.

    df_pivot.apply(compute_yt, axis=1)
    

    With output a predicted value for each original set of 10 values.

                      M                                                    \
    obs_num           0         1         2         3         4         5   
    group_num                                                               
    0         -0.063775 -1.293410  0.395011 -1.224491  1.777335 -2.395643   
    1         -1.111679  1.668670  1.864227 -1.555251  0.959276  0.615344   
    2         -0.213891 -0.733493  0.175590  0.561410  1.359565 -1.341193   
    3          0.534735 -2.154626 -1.226191 -0.309502  1.368085  0.769155   
    4         -0.611289 -0.545276 -1.924381  0.383596  0.322731  0.989450   
    
    
    obs_num           6         7         8         9  
    group_num                                          
    0         -1.461194 -0.481617 -1.101098  1.102030  
    1         -0.120995 -1.046757  1.286074 -0.832990  
    2          0.322485 -0.825315 -2.277746 -0.619008  
    3          0.794694  0.912190 -1.006603  0.572619  
    4         -1.191902  1.229913  1.105221  0.899331