Search code examples
pythonoptimizationpandasdummy-variable

Pandas Optimized Way to Create Dummy-Variable?


I am creating a new dummy variable based off of a given column and a criteria. Below is the code I am working with. It works but is too slow for what I would like to do. Is there a faster, maybe vectorized way do create dummies in pandas? Specifically, according to my example?

I have looked up the get_dummies function in pandas but it seems to do something a little different than what I am doing here. I could be wrong though so if anyone has a way to make get_dummies work with this example, that would be an acceptable answer too.

def flagger(row, criteria, col):
    if row[col] <= criteria:
        return 1
    if row[col] > criteria:
        return 0

dstk['dropflag'] = dstk.apply(lambda row: flagger(row, criteria, col), axis=1)

Edit: There are two good answers here. At a glance they both look equally fast (at least to the same order of magnitude) so I just accepted one. If anyone wants to do some more serious profiling I would be happy to revise my answer choice.


Solution

  • Why not try np.where. It's column-wise vectorized operation and it is much faster than row-wise apply.

    dstk['dropflag'] = np.where(dstk.col <= criteria, 1, 0)