Search code examples
python-3.xpandascountvectorizer

Issue while inserting count vectorizer results to the dataframe


I have a dataframe with shape (4237, 19) and then other dataframe with the shape (4237, 6), I need to combine both these dataframes column wise, so technically resultant dataframe should be of the shape (4237, 25) but am getting as (5524, 25). Am not able to understand the issue.

Code which I have used.

social_media_vectorizer = CountVectorizer(lowercase=True)

train_social_media_vector = social_media_vectorizer.fit_transform(x_train["social_media"].values.astype("U"))
test_social_media_vector = social_media_vectorizer.transform(x_test["social_media"].values.astype('U'))

print(x_train.shape)
print(x_test.shape)

train_social_media_df = pd.DataFrame(train_social_media_vector.todense(), columns=social_media_vectorizer.get_feature_names_out())
test_social_media_df = pd.DataFrame(test_social_media_vector.todense(), columns=social_media_vectorizer.get_feature_names_out())
x_train = pd.concat([x_train, train_social_media_df], axis=1)
x_test = pd.concat([x_test, test_social_media_df], axis=1)

print("="*100)
print(x_train.shape)
print(x_test.shape)

print("="*100)
print(social_media_vectorizer.vocabulary_)

Result

(4237, 19)
(1816, 19)
====================================================================================================
(5524, 25)
(3058, 25)
====================================================================================================
{'facebook': 0, 'linkedin': 2, 'twitter': 4, 'instagram': 1, 'youtube': 5, 'producthunt': 3}

Solution

  • Are you sure the shape of train_social_media_vector.todense() is (4237, 6)? It's seems to be (1287, 6)

    Try to ignore_index=True:

    x_train = pd.concat([x_train, train_social_media_df], axis=1, ignore_index=True)
    x_test = pd.concat([x_test, test_social_media_df], axis=1, ignore_index=True)