I have a training dataset of 1600000 tweets. How can I train this type of huge data.
I have tried something using nltk.NaiveBayesClassifier
. It will take more than 5 days to train if I run.
def extract_features(tweet):
tweet_words = set(tweet)
features = {}
for word in featureList:
features['contains(%s)' % word] = (word in tweet_words)
return features
training_set = nltk.classify.util.apply_features(extract_features, tweets)
NBClassifier = nltk.NaiveBayesClassifier.train(training_set) # This takes lots of time
What should I do?
I need to classify my Dataset using SVM and naive bayes.
Dataset I want to use : Link
Sample(training Dataset):
Label Tweet
0 url aww bummer you shoulda got david carr third day
4 thankyou for your reply are you coming england again anytime soon
Sample(testing Dataset):
Label Tweet
4 love lebron url
0 lebron beast but still cheering the til the end
^
I have to predict Label 0/4 only
How can I train this huge dataset efficiently?
Following what superbly proposed about the features extraction you could use the tfidvectorizer in scikit library to extract the important words from the tweets. Using the default configuration, coupled with a simple LogisticRegression it gives me 0.8 accuracy.Hope that helps. Here is an example on how to use it for you problem:
train_df_raw = pd.read_csv('train.csv',header=None, names=['label','tweet'])
test_df_raw = pd.read_csv('test.csv',header=None, names=['label','tweet'])
train_df_raw = train_df_raw[train_df_raw['tweet'].notnull()]
test_df_raw = test_df_raw[test_df_raw['tweet'].notnull()]
test_df_raw = test_df_raw[test_df_raw['label']!=2]
y_train = [x if x==0 else 1 for x in train_df_raw['label'].tolist()]
y_test = [x if x==0 else 1 for x in test_df_raw['label'].tolist()]
X_train = train_df_raw['tweet'].tolist()
X_test = test_df_raw['tweet'].tolist()
print('At vectorizer')
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train)
print('At vectorizer for test data')
X_test = vectorizer.transform(X_test)
print('at Classifier')
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
predictions = classifier.predict(X_test)
print 'Accuracy:', accuracy_score(y_test, predictions)
confusion_matrix = confusion_matrix(y_test, predictions)
print(confusion_matrix)
Accuracy: 0.8
[[135 42]
[ 30 153]]