Search code examples
pythontwitternlpclassificationtext-classification

Find how similar a text is - One Class Classifier (NLP)


I have a big dataset containing almost 0.5 billions of tweets. I'm doing some research about how firms are engaged in activism and so far, I have labelled tweets which can be clustered in an activism category according to the presence of certain hashtags within the tweets.

Now, let's suppose firms are tweeting about an activism topic without inserting any hashtag in the tweet. My code won't categorized it and my idea was to run a SVM classifier with only one class.

This lead to the following question:

  • Is this solution data-scientifically feasible?
  • Does exists any other one-class classifier?
  • (Most important of all) Are there any other ways to find if a tweet is similar to the ensable of tweets containing activism hashtags?

Thanks in advance for your help!


Solution

  • You have described the setup to a class of problems called "Positive Unlabelled Learning", PUL. The name comes from the fact that you have two types of data: positive ("activism" label) and unlabelled (maybe "activism", maybe not). Your idea, to use an SVM, is common, as are random forests. As in all ML problems, neural nets are becoming more common, however.

    pywsl is a "weak supervision" library which includes some PUL implementations (PUL is a type of weak supervision). Here is an example of using it on some synthetic data

    import numpy as np
    
    from sklearn.model_selection import GridSearchCV, StratifiedKFold
    from sklearn.utils.estimator_checks import check_estimator
    
    from pywsl.pul import pumil_mr
    from pywsl.utils.syndata import gen_twonorm_pumil
    from pywsl.utils.comcalc import bin_clf_err
    
    
    def main():
        prior = .5
        x, y, x_t, y_t = gen_twonorm_pumil(n_p=30, n_u=200, 
                                           prior_u=prior, n_t=100)
        param_grid = {'prior': [prior], 
                      'lam': np.logspace(-3, 1, 5), 
                      'basis': ['minimax']}
        lambda_list = np.logspace(-3, 1, 5)
        clf = GridSearchCV(estimator=pumil_mr.PUMIL_SL(), 
                           param_grid=param_grid,
                           cv=5, n_jobs=-1)
        clf.fit(x, y)
        y_h = clf.predict(x_t)
        err = 100*bin_clf_err(y_h, y_t, prior)
        print("MR: {}%".format(err))
    
    
    if __name__ == "__main__":
        main()
    

    Also, see this possible duplicate question, Binary semi-supervised classification with positive only and unlabeled data set