I am creating a simple recommender that would recommend other users based on the similarity of the tweets. I used tfidf to vectorize all the text and I was able to fit the data into a MultinomialNB
but I keep getting errors of trying to predict
I've tried to reshaping the data into an array, but I get an error can't convert string to float. Can I even use this algorithm for this data? I tried different columns to see if I get a result, but same positional error.
ValueError Traceback (most recent call last)
<ipython-input-39-a982bc4e1f49> in <module>
20 nb_mul.fit(train_idf,y_train)
21 user_knn = UserUser(10, min_sim = 0.4, aggregate='weighted-average')
---> 22 nb_mul.predict(y_test)
23 #nb_mul.predict(np.array(test['Tweets'], test['Sentiment']))
24 #TODO: find a way to predict with test data
~/anaconda2/lib/python3.6/site-packages/sklearn/naive_bayes.py in predict(self, X)
64 Predicted target values for X
65 """
---> 66 jll = self._joint_log_likelihood(X)
67 return self.classes_[np.argmax(jll, axis=1)]
68
~/anaconda2/lib/python3.6/site-packages/sklearn/naive_bayes.py in _joint_log_likelihood(self, X)
728 check_is_fitted(self, "classes_")
729
--> 730 X = check_array(X, accept_sparse='csr')
731 return (safe_sparse_dot(X, self.feature_log_prob_.T) +
732 self.class_log_prior_)
~/anaconda2/lib/python3.6/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
525 try:
526 warnings.simplefilter('error', ComplexWarning)
--> 527 array = np.asarray(array, dtype=dtype, order=order)
528 except ComplexWarning:
529 raise ValueError("Complex data not supported\n"
~/anaconda2/lib/python3.6/site-packages/numpy/core/numeric.py in asarray(a, dtype, order)
536
537 """
--> 538 return array(a, dtype, copy=False, order=order)
539
540
ValueError: could not convert string to float: '["b\'RT @Avalanche: Only two cities have two teams in the second round of the playoffs...\\\\n\\\\nDenver and Boston!\\\\n\\\\n#MileHighBasketball #GoAvsGo http\\\\xe2\\\\x80\\\\xa6\'"]'
for train, test in xf.partition_users(final_test[['user','Tweets','Sentiment']],5, xf.SampleFrac(0.2)):
x_train = []
for index, row in train.iterrows():
x_train.append(row['Tweets'])
y_train = np.array(train['Sentiment'])
y_test = np.array([test['user'],test['Tweets']])
#print(y_train)
tfidf = TfidfVectorizer(min_df=5, max_df = 0.8, sublinear_tf=True, use_idf=True,stop_words='english', lowercase=False)
train_idf = tfidf.fit(x_train)
train_idf = train_idf.transform(x_train)
nb_mul = MultinomialNB()
nb_mul.fit(train_idf,y_train)
user_knn = UserUser(10, min_sim = 0.4, aggregate='weighted-average')
nb_mul.predict(y_test)
The data looks like this
user Tweets \
0 2287418996 ["b'RT @HPbasketball: This stuff is 100% how K...
1 2287418996 ["b'@KeuchelDBeard I may need to rewatch Begin...
2 2287418996 ["b'@keithlaw Is that the stated reason for th...
3 2287418996 ['b"@keithlaw @Yanks23242 I definitely don\'t ...
4 2287418996 ["b'@Yanks23242 @keithlaw Sorry, please sub Jo...
Sentiment Score
0 neu 0.815
1 neu 0.744
2 neu 1.000
3 neu 0.863
4 neu 0.825
Again, I expect to insert users with their tweets and sentiment and recommend another user in the data based off of similarity.
You should not feed the tweets directly to the classifier. You need to use the fitted TfidfVectorizer
for transforming text to vectors.
Make the following change
nb_mul.predict(tfidf.transform(test['Tweets']))
Understand that this model will only give the sentiment of the test data tweets.
If your intention is recommendation try using other recommendation methodologies.