Search code examples
pythonpandasdataframescikit-learncountvectorizer

How do I merge data with CountVectorizer features


Here's my dataset

        body                                            customer_id   name
14828   Thank you to apply to us.                       5458          Sender A
23117   Congratulation your application is accepted.    5136          Sender B
23125   Your OTP will expire in 10 minutes.             5136          Sender A

Here's my code

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
b = a['body']
vect = CountVectorizer()
vect.fit(b)
X_vect=vect.transform(b)
pd.DataFrame(X_vect.toarray(), columns=vect.get_feature_names())

The output is

    10  application apply ... your  
0   0   0           1         0
1   0   1           0         1
2   1   0           0         1 

What I need is

        body                                            customer_id   name        10  application apply ... your
14828   Thank you to apply to us.                       5458          Sender A    0   0           1         0
23117   Congratulation your application is accepted.    5136          Sender B    0   1           0         1
23125   Your OTP will expire in 10 minutes.             5136          Sender A    1   0           0         1

How suppose I do this? I'm still hoping to use CountVectorizer so I can modify the function in the future


Solution

  • You can add index to Dataframe contructor and then join to original df with default left join:

    b = pd.DataFrame(X_vect.toarray(), columns=vect.get_feature_names(), index= a.index)
    a = a.join(b)
    

    Or use merge, but need more parameters, because default is inner join:

    a = a.merge(b, left_index=True, right_index=True, how='left')