Search code examples
pandasdataframemanipulate

From 15 object variables to final target variable (0 or 1)


Can i go from 15 object variables to one final binary target variable?

Those 15 variables has ~10.000 different codes, my dataset is about 21.000.000 records. What im trying to do is at first replace the codes i want with 1 and the other with 0, then if one of fifteen variables is 1 the target variable will be 1, if all fifteen variables are 0 the target variable will be 0.

i have tried to work with to_replace, as_type, to_numeric, infer_objects with not good results,for example my dataset look like this head(5):

    D       P1    P2   P3    P4   P5   P6   P7   P8   P9   P10   P11   P12   P13   P14  P15
    41234   1234  4367 874   NAN  NAN  NAN  789  NAN  NAN  NAN   NAN   NAN   NAN   NAN  NAN 
    42345   7657  4367 874   NAN  NAN  NAN  789  NAN  NAN  NAN   NAN   NAN   NAN   NAN  NAN
    34212   7654  4347 474   NAN  NAN  NAN  789  NAN  NAN  NAN   NAN   NAN   NAN   NAN  NAN
    34212   8902  4317 374   NAN  452  NAN  719  NAN  NAN  NAN   NAN   NAN   NAN   NAN  NAN 
    19374   2564  4387 274   NAN  452  NAN  799  NAN  NAN  NAN   NAN   NAN   NAN   NAN  NAN

I want to transform all nan as 0, and selected codes with 1, so all the P1-P15 will be binary and the i will create a final P variable with them.

For example if P1-P15 have '3578','9732','4734'...(im using about 200 codes) i want to become 1. All the other values i want to become 0. D variable should stay as it is. The final dataset will be (D,P), then i will add the train variables

Any ideas? The following code gives me wrong results.

selCodes=['3722','66']
dfnew['P']=(dfnew.loc[:,'PR1':].astype(str).isin(selCodes).any(axis=1).astype(int))

Take a look at a test dataset(left), and new P(right).With the example code 3722 P should be 1.

Test Dataset New P-3722 is in selCodes, so P should be 1


Solution

  • IIUC, Use, DataFrame.isin:

    # example select codes
    selCodes = ['1234', '9732', '719']
    
    df['P'] = (
        df.loc[:, 'P1':].astype(str)
        .isin(selCodes).any(axis=1).astype(int)
    )
    
    df = df[['D', 'P']]
    

    Result:

           D  P
    0  41234  1
    1  42345  0
    2  34212  0
    3  34212  1
    4  19374  0