I'm writing an algorithm in order to classify the tweets in my dataset as positive/negative and I want to test the accuracy of it. In order to do this and find the best possible solution I want to have a baseline (using classical ML algorithms). After preprocessing the tweets, inspired by the related work, I explored firstly with the Bag-of-Words model and I managed to successfully run the code and calculate the accuracy and the Fscore. After some text preprocessing and splitting the dataset into the train set and the test set:
from sklearn.cross_validation import train_test_split
X_train, X_test1, y_train, y_test1 = train_test_split(X, y, test_size = 0.11, random_state = 0)
I want to be able to eliminate all the tweets labeled as negative from the test set (keeping only the positive ones) and calculate the precision, recall, and Fscore of the algorithm (and afterwards do the same thing for the tweets labeled as positive). I tried doing it like this:
finRow = len(X_test1)
finCol = len(X_test1[0])
for o in range(0, finrow):
if y_test1[o]== 1:
del y_test1[o]
X_test1 = np.delete(X_test1, o, axis=0)
but I get this error:
Traceback (most recent call last):
File "<ipython-input-4-5ed18876a8b5>", line 2, in <module>
if y_test1[o]== 1:
IndexError: list index out of range
X_test1 contains the tweets and it's of size 1102 x 564 and y_test1 contains zeros and ones (the tweet is positive or negative) and has a size of 1102. The error appears at the 774th iteration, when the length of y_test1 decreases from 1102 to 774.
Now, I tried doing it like this also:
a = 1
for o in range(0, finrow):
if (y_test1[o] == 1 and o <= finrow - a):
del y_test1[o]
a = a + 1
X_test1 = np.delete(X_test1, o, axis=0)
but I still get the same error and I don't know if this is the best approach of deleting the rows of the matrix and the elements of the list because when I'm checking the values of y_test1 I still have some (a few, not all - as in the beginning) of the elements that were supposed to be deleted.
I'm kind of new at this, and I have no idea where my mistake is.
You might want to have a look at the function classification_report
in scikit-learn.
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html
It's the easiest way to compute Precision/Recall and F1 for each class.
You just need to pass two arrays, first with the true predictions and the second with the predictions from your classifier, e.g.:
predictions = your_clf.predict(X_test1)
classification_report(y_test1, prediction)