Search code examples
pythonarraysolsmultiplelinearregression

Linear Regression and Generating Data 𝐲=𝐗𝛽+𝜖


I have been given a problem in Jupiter notebooks to code using python. This problem is about linear regression. It's as follows:

1: Linear Regression In this notebook we will generate data from a linear function: 𝐲=𝐗𝛽+𝜖 and then solve for 𝛽̂ using OLS (ordinary least squares) and gradient descent.

Question 1.1 : Generate data: 𝐲=𝐗𝛽+𝜖 Here we assume 𝑦≈𝑔(𝑋,𝛽)=𝐗𝛽+𝜖 where 𝑔 is linear in 𝛽 with additive noise 𝜖 Your function should have the following properties:

output y as an np.array with shape (M,1) generate_linear_y should work for any arbitrary x, b, and eps, as long as they are the appropriate dimensions do not use for-loops to calculate each y[i] separately, as this will be very slow for large M and N. Instead, you should leverage numpy linear algebra.


They expect us to write code as follows:

def generate_linear_y(X,b):
""" Write a function that generates m data points from inputs X and b

Parameters
----------
X :   numpy.ndarray
      x.shape must be (M,N)
      Each row of `X` is a single data point of dimension N
      Therefore `X` represents M data points

b :   numpy.ndarray
      b.shape must be (N,1)
      Each element of `b` is a value of beta such that b=[[b1][b2]...[bN]]


Returns
-------
y :   numpy.ndarray
      y.shape = (M,1)
      y[i] = X[i]b
"""

Can someone please assist me because I am thoroughly confused! I didn't even realize the things I am doing required array coding in python, which I always struggle with! Please help!


Solution

  • This looks like a direct matrix multiplication to me. In NumPy, this is implemented using the matrix multiplication operator @ (aka np.matmul).

    To generate random noise, you can use the functions from numpy.random, most likely random_sample or standard_normal. If you want to do it the most-correct way, you can create a random number generator with default_rng, then use, for instance, rng.standard_normal.