Search code examples
pythonregressionkalman-filterpykalman

Multiple regression with pykalman?


I'm looking for a way to generalize regression using pykalman from 1 to N regressors. We will not bother about online regression initially - I just want a toy example to set up the Kalman filter for 2 regressors instead of 1, i.e. Y = c1 * x1 + c2 * x2 + const.

For the single regressor case, the following code works. My question is how to change the filter setup so it works for two regressors:

    import matplotlib.pyplot as plt
    import numpy as np
    import pandas as pd
    from pykalman import KalmanFilter

    if __name__ == "__main__":
        file_name = '<path>\KalmanExample.txt'
        df = pd.read_csv(file_name, index_col = 0)
        prices = df[['ETF', 'ASSET_1']] #, 'ASSET_2']]
    
        delta = 1e-5
        trans_cov = delta / (1 - delta) * np.eye(2)
        obs_mat = np.vstack( [prices['ETF'], 
                            np.ones(prices['ETF'].shape)]).T[:, np.newaxis]
    
        kf = KalmanFilter(
            n_dim_obs=1,
            n_dim_state=2,
            initial_state_mean=np.zeros(2),
            initial_state_covariance=np.ones((2, 2)),
            transition_matrices=np.eye(2),
            observation_matrices=obs_mat,
            observation_covariance=1.0,
            transition_covariance=trans_cov
        )
    
        state_means, state_covs = kf.filter(prices['ASSET_1'].values)
    
        # Draw slope and intercept...
        pd.DataFrame(
            dict(
                slope=state_means[:, 0],
                intercept=state_means[:, 1]
            ), index=prices.index
        ).plot(subplots=True)
        plt.show()

The example file KalmanExample.txt contains the following data:

Date,ETF,ASSET_1,ASSET_2
2007-01-02,176.5,136.5,141.0
2007-01-03,169.5,115.5,143.25
2007-01-04,160.5,111.75,143.5
2007-01-05,160.5,112.25,143.25
2007-01-08,161.0,112.0,142.5
2007-01-09,155.5,110.5,141.25
2007-01-10,156.5,112.75,141.25
2007-01-11,162.0,118.5,142.75
2007-01-12,161.5,117.0,142.5
2007-01-15,160.0,118.75,146.75
2007-01-16,156.5,119.5,146.75
2007-01-17,155.0,120.5,145.75
2007-01-18,154.5,124.5,144.0
2007-01-19,155.5,126.0,142.75
2007-01-22,157.5,124.5,142.5
2007-01-23,161.5,124.25,141.75
2007-01-24,164.5,125.25,142.75
2007-01-25,164.0,126.5,143.0
2007-01-26,161.5,128.5,143.0
2007-01-29,161.5,128.5,140.0
2007-01-30,161.5,129.75,139.25
2007-01-31,161.5,131.5,137.5
2007-02-01,164.0,130.0,137.0
2007-02-02,156.5,132.0,128.75
2007-02-05,156.0,131.5,132.0
2007-02-06,159.0,131.25,130.25
2007-02-07,159.5,136.25,131.5
2007-02-08,153.5,136.0,129.5
2007-02-09,154.5,138.75,128.5
2007-02-12,151.0,136.75,126.0
2007-02-13,151.5,139.5,126.75
2007-02-14,155.0,169.0,129.75
2007-02-15,153.0,169.5,129.75
2007-02-16,149.75,166.5,128.0
2007-02-19,150.0,168.5,130.0

The single regressor case provides the following output and for the two-regressor case I want a second "slope"-plot representing C2.

enter image description here


Solution

  • Answer edited to reflect my revised understanding of the question.

    If I understand correctly you wish to model an observable output variable Y = ETF, as a linear combination of two observable values; ASSET_1, ASSET_2.

    The coefficients of this regression are to be treated as the system states, i.e. ETF = x1*ASSET_1 + x2*ASSET_2 + x3, where x1 and x2 are the coefficients assets 1 and 2 respectively, and x3 is the intercept. These coefficients are assumed to evolve slowly.

    Code implementing this is given below, note that this is just extending the existing example to have one more regressor.

    Note also that you can get quite different results by playing with the delta parameter. If this is set large (far from zero), then the coefficients will change more rapidly, and the reconstruction of the regressand will be near-perfect. If it is set small (very close to zero) then the coefficients will evolve more slowly and the reconstruction of the regressand will be less perfect. You might want to look into the Expectation Maximisation algorithm - supported by pykalman.

    CODE:

    import matplotlib.pyplot as plt
    import numpy as np
    import pandas as pd
    from pykalman import KalmanFilter
    
    if __name__ == "__main__":
        file_name = 'KalmanExample.txt'
        df = pd.read_csv(file_name, index_col = 0)
        prices = df[['ETF', 'ASSET_1', 'ASSET_2']]
        delta = 1e-3
        trans_cov = delta / (1 - delta) * np.eye(3)
        obs_mat = np.vstack( [prices['ASSET_1'], prices['ASSET_2'],  
                              np.ones(prices['ASSET_1'].shape)]).T[:, np.newaxis]
        kf = KalmanFilter(
            n_dim_obs=1,
            n_dim_state=3,
            initial_state_mean=np.zeros(3),
            initial_state_covariance=np.ones((3, 3)),
            transition_matrices=np.eye(3),
            observation_matrices=obs_mat,
            observation_covariance=1.0,
            transition_covariance=trans_cov        
        )
    
        # state_means, state_covs = kf.em(prices['ETF'].values).smooth(prices['ETF'].values)
        state_means, state_covs = kf.filter(prices['ETF'].values)
    
    
        # Re-construct ETF from coefficients and 'ASSET_1' and ASSET_2 values:
        ETF_est = np.array([a.dot(b) for a, b in zip(np.squeeze(obs_mat), state_means)])
    
        # Draw slope and intercept...
        pd.DataFrame(
            dict(
                slope1=state_means[:, 0],
                slope2=state_means[:, 1],
                intercept=state_means[:, 2],
            ), index=prices.index
        ).plot(subplots=True)
        plt.show()
    
        # Draw actual y, and estimated y:
        pd.DataFrame(
            dict(
                ETF_est=ETF_est,
                ETF_act=prices['ETF'].values
            ), index=prices.index
        ).plot()
        plt.show()
    

    PLOTS:

    States evolving

    Reconstruction of regressand