Search code examples
pythonscikit-learnlinear-regressionmulticollinearity

How to simulate multi-collinearity using Sklearn?


I want to see what effect multi-collinearity has on a linear regression model but I need to be able to generate multi collinear data where I can vary the number of features and the collinearity between these features.

I've had a look at Sklearn's make_regression function and it allows for the generation of multiple features but from what I understand these features are all uncorrelated correct?

If so, does anyone know how I could vary the correlation between these features or use a different method to generate a linearly multi-collinear dataset to train Sklearn's linear regression model with?


Solution

  • You could simulate the features from the multivariate normal distribution as follows:

    import numpy as np
    from sklearn.linear_model import LinearRegression
    
    def make_regression(n_samples, n_uncorrelated, n_correlated, correlation, weights, bias, noise=1, seed=42):
    
        np.random.seed(seed)
    
        X_correlated = np.random.multivariate_normal(
            mean=np.zeros(n_correlated),
            cov=correlation * np.ones((n_correlated, n_correlated)) + (1 - correlation) * np.eye(n_correlated),
            size=n_samples
        )
    
        X_uncorrelated = np.random.multivariate_normal(
            mean=np.zeros(n_uncorrelated),
            cov=np.eye(n_uncorrelated),
            size=n_samples
        )
    
        X = np.hstack([X_correlated, X_uncorrelated])
        e = np.random.normal(loc=0, scale=noise, size=n_samples)
        y = bias + np.dot(X, weights) + e
    
        return X, y
    
    X, y = make_regression(
        n_samples=1000,
        n_uncorrelated=1,
        n_correlated=3,
        correlation=0.999,
        weights=[0.5, 0.5, 0.5, 0.5],
        bias=0,
    )
    
    print(np.round(np.corrcoef(X, rowvar=False), 1))
    # [[ 1.  1.  1. -0.]
    #  [ 1.  1.  1. -0.]
    #  [ 1.  1.  1. -0.]
    #  [-0. -0. -0.  1.]]
    
    reg = LinearRegression()
    reg.fit(X, y)
    
    print(reg.intercept_)
    # -0.0503434375710194
    
    print(reg.coef_)
    # [0.62245063 -0.43110213  1.31516103  0.52019845]