Search code examples
pythonpandasdata-structuresdata-sciencedummy-variable

Creating Dummy Variables from String Column


I have a pandas dataframe (N = 1485) that looks like this:

ID          Intervention
1           Blood Draw, Flushed, Locked
1           Blood Draw, Port De-Accessed, Heparin-Locked, Tubing Changed
1           Blood Draw, Flushed
2           Blood return Verified, Flushed
2           Cap Changed
3           Port De-Accessed

I want to be able to dummy code out each of the string before every comma so it looks similar to this:

ID          Blood Draw          Flushed          Locked      ....
1              Yes                Yes             Yes
1              Yes                No              No
...

Thanks!


Solution

  • You can use pd.Series.str.get_dummies and a dictionary mapping:

    d = {1: 'yes', 0: 'no'}
    res = df.join(df.pop('Intervention').str.get_dummies(', ').applymap(d.get))
    

    In my opinion, it's best to convert to strings for display purposes only. Boolean values are more efficiently held and manipulated in Boolean series.

    Result

    print(res)
    
       ID Blood Draw Blood return Verified Cap Changed Flushed Heparin-Locked  \
    0   1        yes                    no          no     yes             no   
    1   1        yes                    no          no      no            yes   
    2   1        yes                    no          no     yes             no   
    3   2         no                   yes          no     yes             no   
    4   2         no                    no         yes      no             no   
    5   3         no                    no          no      no             no   
    
      Locked Port De-Accessed Tubing Changed  
    0    yes               no             no  
    1     no              yes            yes  
    2     no               no             no  
    3     no               no             no  
    4     no               no             no  
    5     no              yes             no  
    

    Setup

    df = pd.DataFrame({'ID': [1, 1, 1, 2, 2, 3],
                       'Intervention': ['Blood Draw, Flushed, Locked',
                                        'Blood Draw, Port De-Accessed, Heparin-Locked, Tubing Changed',
                                        'Blood Draw, Flushed', 'Blood return Verified, Flushed',
                                        'Cap Changed', 'Port De-Accessed']})