Search code examples
pythonlinear-regression

What's the easiest way to calculate regression coefficient in python?


I have a 100 by 1 response variable Y, and a 100 by 3 predictor matrix X. I want to calcualte the regression coefficient (X'X)^{-1}X'Y.

Currently I'm doing it as follows:

invXpX=inv(np.dot(np.transpose(X),X))
XpY=np.dot(np.transpose(X),Y)
betahat=np.dot(invXpX,XpY)

This looks pretty cumbersome, while in MATLAB we could do it just like the original math formula: inv(X'*X)*X'*Y. Is there an easier way to calculate this regression coefficient in python? Thanks!


Solution

  • Yes it can be written more compact, but note that this will not always improve your code, or the readability.

    The transpose of a numpy array can be found using dot T (.T). If you use numpy matrix instead of numpy arrays you can also use .I for the inverse, but I would recommend you to use ndarray. For the dot product you can use @. Thereby np.dot(X,Y) = X.dot(Y) when X and Y are numpy arrays.

    import numpy as np
    # Simulate data using a quadratic equation with coefficients y=ax^2+bx+c 
    a, b, c = 1, 2, 3
    x = np.arange(100)
    # Add random component to y values for estimation
    y = a*x**2 + b*x + c + np.random.randn(100)
    # Get X matrix [100x3]
    X = np.vstack([x**2, x, np.ones(x.shape)]).T
    
    # Estimate coefficients a, b, c
    x_hat = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(y)
    >>> array([0.99998334, 2.00246583, 2.95697339])
    x_hat = np.linalg.inv(X.T@(X))@(X.T)@(y)
    >>> array([0.99998334, 2.00246583, 2.95697339])
    
    # Use matrix:
    X_mat = np.matrix(X)
    x_hat = (X_mat.T@X_mat).I@X_mat.T@y
    >>> matrix([[0.99998334, 2.00246583, 2.95697339]])
    
    # without noise:
    y = a*x**2 + b*x + c 
    x_hat = (X_mat.T@X_mat).I@X_mat.T@y
    >>> matrix([[1., 2., 3.]])