Search code examples
pythondataframenumpyscipynormal-distribution

Calculate weight using normal distribution in Python


I have to add a weight column in the titanic dataset to calculate adult passengers' weight using a normal distribution with std = 20 and mean = 70 kg. I have tried this code:

df['Weight'] = np.random.normal(20, 70, size=891)
df['Weight'].fillna(df['Weight'].iloc[0], inplace=True)

but I am concerned about two things:

  1. It generates negative values, not just positive; how can this be considered normal weight value, is there anything that I can change in code to generate just positive values.
  2. Since I am targeting the adults' age group, what about children. Some of them also have abnormal weight values, such as 7 kg for adults or 30 kg for a child; how can this be solved. I appreciate any help you can provide.

Edit:

This code worked for me

Weight = np.random.normal(80, 20, 718)
adults['Weight'] = Weight

Now I have to calculate probability for people weighted less than 70 and who is between 70 and 100.

I have tried the following code but it raise an error: TypeError: unsupported operand type(s) for -: 'str' and 'int'.

import pandas as pd
import numpy as np
import scipy.stats

adults = df[(df['Age'] >= 20) & (df['Age'] <= 70)]

Weight = np.random.normal(80, 20, 718)
adults['Weight'] = Weight

p1 = adults['Weight'] < 70
p2 = adults[(adults['Weight'] > 70) & (adults['Weight'] < 100)]

scipy.stats.norm.pdf(p1)
scipy.stats.norm.pdf(p2)

Solution

    1. Range of a Normal distribution is not restricted. It spans all across real numbers. If you want to restrict it, you should do it manually or use other distributions.

      df['Weight'] = np.random.normal(20, 70, size=891)
      df.loc[df['Weight'] < min_value, 'Weight'] = min_value
      df.loc[df['Weight'] > max_value, 'Weight'] = max_value
      
    2. Since weights of children and adults are not iid's you should sample it from different distributions

      # use different distributions
      df.loc[df['person_type'] == 'child', 'Weight'] = np.random.normal(x1, y1, size=children_size)
      df.loc[df['person_type'] == 'adult', 'Weight'] = np.random.normal(x2, y2, size=adult_size)