Search code examples
pythondataframedata-cleaningone-hot-encoding

One hot encoding from a dataframe string column with multiple values


I have a dataframe "df1" composed of 1245 lines, with a column text (object type) and topic (object type). The topic column contain different number corresponding to text label. Here is an exemple :

        text                                                topic
1207    June 2019: The French Facility for Global Envi...   3 12 7
1208    May 2019: Participants from multi-stakeholder ...   8
1209    2 July 2019: UN Member States have reached agr...   1 7
1210    30 June 2019: The G20 Leaders’ Summit and asso...   7 8 9 11 12 13 14 15 17

I would like to obtain an one hot encoding form like that (also add an 'S' before number in column name) :

        text                                                S1  S2  S3 ..... S7  S8 S9 etc.
1207    June 2019: The French Facility for Global Envi...   0    0   1  ..... 1   0  0
1208    May 2019: Participants from multi-stakeholder ...   0    0   0 ...... 0   1  0
1209    2 July 2019: UN Member States have reached agr...   1    0   0  ..... 1   0  0
1210    30 June 2019: The G20 Leaders’ Summit and asso...   0    0   0  ......1   1  1

The 'difficulty' here is that my text is multilabeled so code for simple one-hot-encoding do not works for my case. Any idea ?


Solution

  • With only using pandas, you can do something like this:

    import pandas as pd
    
    
    data = [['June 2019: The French Facility for Global Envi...', '3 12 7'],
           ['May 2019: Participants from multi-stakeholder ...','8'],
           ['2 July 2019: UN Member States have reached agr...','1 7'],
           ['30 June 2019: The G20 Leaders’ Summit and asso...','7 8 9 11 12 13 14 15 17']]
    df = pd.DataFrame(data , columns=['text', 'topic'])
    
    # creating list of strings where each value is one number out of topic column
    unique_values = ' '.join(df['topic'].values.tolist()).split(' ')
    
    # creating new column for each value in unique_values
    for number in unique_values:
        df[f'S{number}'] = 0
        
    # changing 0 to 1 for every Snumber column where topic contains number
    for idx, row in df.iterrows():
        for number in row['topic'].split(' '):
            df.loc[idx, f'S{number}'] = 1
    df.drop('topic', axis=1, inplace=True)
    

    Result:

    
        text                                                S3  S12 S7  S8  S1  S9  S11 S13 S14 S15 S17
    0   June 2019: The French Facility for Global Envi...   1   1   1   0   0   0   0   0   0   0   0
    1   May 2019: Participants from multi-stakeholder ...   0   0   0   1   0   0   0   0   0   0   0
    2   2 July 2019: UN Member States have reached agr...   0   0   1   0   1   0   0   0   0   0   0
    3   30 June 2019: The G20 Leaders’ Summit and asso...   0   1   1   1   0   1   1   1   1   1   1