I have a big dataset containing almost 0.5 billions of tweets. I'm doing some research about how firms are engaged in activism and so far, I have labelled tweets which can be clustered in an activism category according to the presence of certain hashtags within the tweets.
Now, let's suppose firms are tweeting about an activism topic without inserting any hashtag in the tweet. My code won't categorized it and my idea was to run a SVM classifier with only one class.
This lead to the following question:
Thanks in advance for your help!
You have described the setup to a class of problems called "Positive Unlabelled Learning", PUL. The name comes from the fact that you have two types of data: positive ("activism" label) and unlabelled (maybe "activism", maybe not). Your idea, to use an SVM, is common, as are random forests. As in all ML problems, neural nets are becoming more common, however.
pywsl is a "weak supervision" library which includes some PUL implementations (PUL is a type of weak supervision). Here is an example of using it on some synthetic data
import numpy as np
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.utils.estimator_checks import check_estimator
from pywsl.pul import pumil_mr
from pywsl.utils.syndata import gen_twonorm_pumil
from pywsl.utils.comcalc import bin_clf_err
def main():
prior = .5
x, y, x_t, y_t = gen_twonorm_pumil(n_p=30, n_u=200,
prior_u=prior, n_t=100)
param_grid = {'prior': [prior],
'lam': np.logspace(-3, 1, 5),
'basis': ['minimax']}
lambda_list = np.logspace(-3, 1, 5)
clf = GridSearchCV(estimator=pumil_mr.PUMIL_SL(),
param_grid=param_grid,
cv=5, n_jobs=-1)
clf.fit(x, y)
y_h = clf.predict(x_t)
err = 100*bin_clf_err(y_h, y_t, prior)
print("MR: {}%".format(err))
if __name__ == "__main__":
main()
Also, see this possible duplicate question, Binary semi-supervised classification with positive only and unlabeled data set