Search code examples
pythonpandasvariablesregressiondummy-variable

Do I had to loop? Is there a faster way to build dummy variables?


I have some plant data that looks like (but I have up to 7 attributes):

     Unnamed: 0     plant          att_1           att_2 ...
0            0     plant_a         sunlover        tall
1            1     plant_b         waterlover      sunlover
2            2     plant_c         fast growing    sunlover

I tried to use pandas get_dummies like:

df = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': ['b', 'a', 'c'],'C': [1, 2, 3]})

pd.get_dummies(df, prefix=['col1', 'col2']):

.

 C  col1_a  col1_b  col2_a  col2_b  col2_c
 0  1       1       0       0       1       0
 1  2       0       1       1       0       0
 2  3       1       0       0       0       1

But sunlover should be encoded as 1 nevertheless it is in att_1 or att_2. Then I will end up with approx 30 dummy variables instead of 7 * 30 = 210 variables. I have tried to loop through the whole set and add the value for each dummy:

for count, plants in enumerate(data_plants.iterrows()):
  print("First", count, plants)
  for attribute in plants:
        print("Second", count, attribute)

The code is just printing because i saw the issue of time wasting code. That work but it is not fast enough to use for 100k and more rows. I thought of using .value_counts() to get the attributes and then accessing the dataframe dummy variable to update that to 1, but then I will overwrite the attribute. At the moment I am a litte bit lost and out of ideas. Perhaps I had to use an other package?

The goal would be something like:

     Unnamed: 0     plant          att_1           att_2       sunlover      waterlover     tall  ...
0            0     plant_a         sunlover        tall        1             0              1
1            1     plant_b         waterlover      sunlover    1             1              0
2            2     plant_c         fast growing    sunlover    1             0              0

Solution

  • Use get_dummies with max:

    c = ['att_1', 'att_2']
    df1 = df.join(pd.get_dummies(df[c], prefix='', prefix_sep='').max(axis=1, level=0))
    print (df1)
         plant         att_1     att_2  fast growing  sunlover  waterlover  tall
    0  plant_a      sunlover      tall             0         1           0     1
    1  plant_b    waterlover  sunlover             0         1           1     0
    2  plant_c  fast growing  sunlover             1         1           0     0
    

    Performance for 3k rows, in real data should be different:

    df = pd.concat([df] * 1000, ignore_index=True)
    
    
    In [339]: %%timeit
         ...: 
         ...: c = ['att_1', 'att_2']
         ...: df1 = df.join(pd.get_dummies(df[c], prefix='', prefix_sep='').max(axis=1, level=0))
         ...: 
         ...: 
    10.7 ms ± 1.11 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    In [340]: %%timeit
         ...: attCols = df[['att_1', 'att_2']]
         ...: colVals = pd.Index(np.sort(attCols.stack().unique()))
         ...: def myDummies(row):
         ...:     return pd.Series(colVals.isin(row).astype(int), index=colVals)
         ...: 
         ...: df1 = df.join(attCols.apply(myDummies, axis=1))
         ...: 
         ...: 
    1.03 s ± 22 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

    Another solution:

    In [133]: %%timeit
         ...: c = ['att_1', 'att_2']
         ...: df1 = (df.join(pd.DataFrame([dict.fromkeys(x, 1) for x in df[c].to_numpy()])
         ...:                  .fillna(0)
         ...:                  .astype(np.int8)))
         ...:                  
    13.1 ms ± 723 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)