Search code examples
pythonpandasrandomconditional-statementsweighted

Create random values in dataframe with different weights by condition


I have been trying to create a simulated dataframe with the sex and education as features, generating the data according to some proportions I already know.

Something like this:

weight_sex = [0.55, 0.45]
options_sex = [0, 1] # 0 = Men, 1 = Women

weight_educ = [0.6, 26.8, 23.6, 23.6, 24.1, 1.3]

options_educ = [0, 1, 2, 3, 4, 5] # 0 = None, 5 = Bachelor or more

sex = pd.Series(random.choices(options_sex, weights = weight_sex, k = 100), name = 'sex')
education = pd.Series(random.choices(options_educ, weights = weight_educ, k = 100), name = 'education')
people = pd.concat([sex, education], axis = 1)

Now I want to create a new column which will say if the person is unemployed, has an informal work or has a formal work. I know this proportions to be different depending on the features of the population, to make it simple let's say males with education higher than 3 have a better occupation rate then the rest of the population.

Something like this:

Option_work = [0, 1, 2] # 0 = Unemployed, 1 = informal work, 2 = formal work
weight_work_educated_man = [0.2, 0.3, 0.5]
weight_work_other_people = [0.3, 0.4, 0.3]

So if I say

people['sex'] == 0 & people['education'] > 3 generate the value with the weight_work_educated_man

And if I say

people['education'] <= 3 generate the value with the weight_work_other_people

How can I create a new column randomizing the data with the weights I have but with the features of the row as condition? I've been trying to find a way with random.choice or the sample function from pandas but got stuck. It is important to be randomized so the results don't be exactly the same the next time I run the code.


Solution

  • Expanding on your code you can do this:

    options_work = [0, 1, 2] # 0 = Unemployed, 1 = informal work, 2 = formal work
    weight_educ = [0.2, 0.3, 0.5]
    weight_other = [0.3, 0.4, 0.3]
    
    # create two dataframes for both choices «educated» and «other»
    work_educ = pd.DataFrame(random.choices(options_work, weights=weight_educ, k=100), columns=['work'])
    work_other = pd.DataFrame(random.choices(options_work, weights=weight_other, k=100), columns=['work'])
    
    # create new column «work» and fill with data from one choice
    people["work"] = work_other
    
    # filter dataframe on your condition and replace values in column with values from other choice
    people.loc[(people.sex==0) & (people.education>3), "work"] = work_educ
    people.head()
    

    Print out:

        sex     education   work
    0   1       1           1
    1   1       4           1
    2   1       4           0
    3   0       1           2
    4   0       4           0