Search code examples
numpystatisticsnormal-distribution

Confused by random.randn()


I am a bit confused by the numpy function random.randn() which returns random values from the standard normal distribution in an array in the size of your choosing.

My question is that I have no idea when this would ever be useful in applied practices.

For reference about me I am a complete programming noob but studied math (mostly stats related courses) as an undergraduate.


Solution

  • The Python function randn is incredibly useful for adding in a random noise element into a dataset that you create for initial testing of a machine learning model. Say for example that you want to create a million point dataset that is roughly linear for testing a regression algorithm. You create a million data points using

    x_data = np.linspace(0.0,10.0,1000000)

    You generate a million random noise values using randn

    noise = np.random.randn(len(x_data))

    To create your linear data set you follow the formula y = mx + b + noise_levels with the following code (setting b = 5, m = 0.5 in this example)

    y_data = (0.5 * x_data ) + 5 + noise

    Finally the dataset is created with

    my_data = pd.concat([pd.DataFrame(data=x_data,columns=['X Data']),pd.DataFrame(data=y_data,columns=['Y'])],axis=1)