Search code examples
pandasscikit-learnnltktokenize

How do I do word tokenisation in pandas data frame


This is my data

No  Text                    
1   You are smart
2   You are beautiful

My expected output

No  Text                   You    are  smart  beautiful                 
1   You are smart            1      1      1          0
2   You are beautiful        1      1      0          1

Solution

  • For nltk solution need word_tokenize for list of words, then MultiLabelBinarizer and last join to original:

    from sklearn.preprocessing import MultiLabelBinarizer
    from  nltk import word_tokenize
    
    mlb = MultiLabelBinarizer()
    s = df.apply(lambda row: word_tokenize(row['Text']), axis=1)
    df = df.join(pd.DataFrame(mlb.fit_transform(s),columns=mlb.classes_, index=df.index))
    print (df)
       No               Text  You  are  beautiful  smart
    0   1      You are smart    1    1          0      1
    1   2  You are beautiful    1    1          1      0
    

    For pure pandas use get_dummies + join:

    df = df.join(df['Text'].str.get_dummies(sep=' '))