Search code examples
pythonpandasdataframemissing-data

Creating a dummy variable based on a string pattern in python


I have the following dataset. None is defined as a python missing value. The type is object (from dt.types)

import pandas as pd
import numpy as np

df = pd.DataFrame(columns=['triparty'])
df["triparty"] = ["AB65", "None", "GDW322", "DASED", "None"]

I want to create a dummy that takes the value 1 when triparty is None and 0 otherwise. I tried out several variations of

df["triparty"]=[0 if df["triparty"] == np.NaN else 1 for x in df["triparty"]]

df["triparty"]=[0 if df["triparty"] == "None" else 1 for x in df["triparty"]]

but it does not seem to work. I get the error message ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

How can I solve the problem?


Solution

  • You can do it with np.where

    df["dummy"] = np.where(df["triparty"] == "None", 0, 1)
    print(df)
    

    Or create column of bool as int type.

    df["dummy"] = (df["triparty"] != "None").astype(int)
    # or
    df["dummy"] = (~(df["triparty"] == "None")).astype(int)
    

    Output

      triparty  dummy
    0     AB65      1
    1     None      0
    2   GDW322      1
    3    DASED      1
    4     None      0