Search code examples
pythonpandasdataframetraining-databayesian-networks

Pandas Dataframe - Adding Else?


I want to generate Test Data for my Bayesian Network. This is my current Code:

data = np.random.randint(2, size=(5, 6))
columns = ['p_1', 'p_2', 'OP1', 'OP2', 'OP3', 'OP4']
df = pd.DataFrame(data=data, columns=columns)

df.loc[(df['p_1'] == 1) & (df['p_2'] == 1), 'OP1'] = 1

df.loc[(df['p_1'] == 1) & (df['p_2'] == 0), 'OP2'] = 1

df.loc[(df['p_1'] == 0) & (df['p_2'] == 1), 'OP3'] = 1

df.loc[(df['p_1'] == 0) & (df['p_2'] == 0), 'OP4'] = 1


print(df)

So every time, for example, p_1 has a 1 and p_2 has a 1, the OP1 should be 1 as well, all the other values should output 0 in the column. When p_1 is 1 and p_2 is 0, then OP2 should be 1 an d all others 0, and so on.

But my current Output is the following:

p_1 p_2 OP1 OP2 OP3 OP4
0 0 0 0 0 1
1 0 1 1 1 1
0 0 1 1 0 1
0 1 1 1 1 1
1 0 0 1 1 0

Is there any way to fix it? What did I do wrong?

I didn't really understand the solutions to other peoples questions, so I thought Id ask here.

I hope that someone can help me.


Solution

  • The problem is that when you instantiate df, the "OP" columns already have some values:

    data = np.random.randint(2, size=(5, 6)) 
    columns = ['p_1', 'p_2', 'OP1', 'OP2', 'OP3', 'OP4'] 
    df = pd.DataFrame(data=data, columns=columns) 
    
    df                                                                      
    
       p_1  p_2  OP1  OP2  OP3  OP4
    0    1    1    0    1    0    0
    1    0    0    1    1    0    1
    2    0    1    1    1    0    0
    3    1    1    1    1    0    1
    4    0    1    1    0    1    0
    
    

    One way of fixing it with your code is forcing all "OP" columns to 0 before:

    df["OP1"] = df["OP2"] = df["OP3"] df["OP4"] = 0    
    

    But then you are generating too many random numbers. I'd do this instead:

    data = np.random.randint(2, size=(5, 2)) 
    columns = ['p_1', 'p_2'] 
    df = pd.DataFrame(data=data, columns=columns) 
    df["OP1"] = ((df['p_1'] == 0) & (df['p_2'] == 1)).astype(int)