There are two .csv files containing Tweets and a classification for each Tweet: pos
, neg
and neutral
. class
means classification and text
a Tweet.
This is my code:
def prediction():
print("Reading files...")
#Will learn from this data set.
train = file2SentencesArray('twitter-sanders-apple3')
#Test dataset.
test = file2SentencesArray('twitter-sanders-apple2')
print("Complete!")
print("Cleaning sentences...")
#cleanSenteces will remove html, stop words and some characters.
cleanTrainSentences = cleanSentences(train["text"])
cleanTestSentences = cleanSentences(test["text"])
print("Complete!...")
print("Fiting sentences...")
vectorizer = CountVectorizer(analyzer="word", tokenizer=None, preprocessor=None, stop_words=None, max_features=5000)
trainDataFeatures = vectorizer.fit_transform(cleanTrainSentences)
np.asarray(trainDataFeatures)
testDataFeatures = vectorizer.transform(cleanTestSentences)
np.asarray(testDataFeatures)
#Getting error here.
randomized_lasso = RandomizedLasso()
randomized_lasso.fit_transform(trainDataFeatures, testDataFeatures)
trainDataFeatures = randomized_lasso.transform(trainDataFeatures)
#and here.
#pca = decomposition.PCA(n_components=2)
#pca.fit_transform(trainDataFeatures)
#trainDataFeatures = pca.transform(trainDataFeatures)
print("Complete!")
print("Predicting...")
forest = RandomForestClassifier(n_estimators=100)
forest = forest.fit(trainDataFeatures, train["class"])
result = forest.predict(testDataFeatures)
print("Complete...")
return result
The Randomized lasso and the PCA are both throwing exceptions:
PCA – PCA does not support sparse input.
Randomized lasso – bad input shape
My trainDataFeatures
looks like this:
(0, 573) 1
(0, 1411) 2
(0, 2748) 1
(0, 1073) 1
(1, 126) 1
(2, 1203) 1
The input format for both PCA and Randomized Lasso is not correct. Please replace the following two lines and try again.
np.asarray(trainDataFeatures)
np.asarray(testDataFeatures)
# replace the above two lines with these
trainDataFeatures = trainDataFeatures.toarray()
testDataFeatures = testDataFeatures.toarray()