Search code examples
pythonclassificationnltksvmnaivebayes

How to train large Dataset for classification


I have a training dataset of 1600000 tweets. How can I train this type of huge data.

I have tried something using nltk.NaiveBayesClassifier. It will take more than 5 days to train if I run.

def extract_features(tweet):

    tweet_words = set(tweet)
    features = {}
    for word in featureList:
        features['contains(%s)' % word] = (word in tweet_words)

    return features


training_set = nltk.classify.util.apply_features(extract_features, tweets)

NBClassifier = nltk.NaiveBayesClassifier.train(training_set)  # This takes lots of time  

What should I do?

I need to classify my Dataset using SVM and naive bayes.

Dataset I want to use : Link

Sample(training Dataset):

Label     Tweet
0         url aww bummer you shoulda got david carr third day
4         thankyou for your reply are you coming england again anytime soon

Sample(testing Dataset):

Label     Tweet
4         love lebron url
0         lebron beast but still cheering the til the end
^
I have to predict Label 0/4 only

How can I train this huge dataset efficiently?


Solution

  • Following what superbly proposed about the features extraction you could use the tfidvectorizer in scikit library to extract the important words from the tweets. Using the default configuration, coupled with a simple LogisticRegression it gives me 0.8 accuracy.Hope that helps. Here is an example on how to use it for you problem:

        train_df_raw = pd.read_csv('train.csv',header=None, names=['label','tweet'])
    test_df_raw = pd.read_csv('test.csv',header=None, names=['label','tweet'])
    train_df_raw =  train_df_raw[train_df_raw['tweet'].notnull()]
    test_df_raw =  test_df_raw[test_df_raw['tweet'].notnull()]
    test_df_raw =  test_df_raw[test_df_raw['label']!=2]
    
    y_train = [x if x==0 else 1 for x in train_df_raw['label'].tolist()]
    y_test = [x if x==0 else 1 for x in test_df_raw['label'].tolist()]
    X_train = train_df_raw['tweet'].tolist()
    X_test = test_df_raw['tweet'].tolist()
    
    print('At vectorizer')
    vectorizer = TfidfVectorizer()
    X_train = vectorizer.fit_transform(X_train)
    print('At vectorizer for test data')
    X_test = vectorizer.transform(X_test)
    
    print('at Classifier')
    classifier = LogisticRegression()
    classifier.fit(X_train, y_train)
    
    predictions = classifier.predict(X_test)
    print 'Accuracy:', accuracy_score(y_test, predictions)
    
    confusion_matrix = confusion_matrix(y_test, predictions)
    print(confusion_matrix)
    
    Accuracy: 0.8
    [[135  42]
     [ 30 153]]