Does anyone know of any good broad twitter categorisation corpora?
I am looking for broad categories such as:
- sport
- science/technology
- food
- health
- entertainment
- music
- games
- finance
- education
- politics
- television
- religion
- motor
- conflict
(I think that pretty much covers everything)
There are very nice resources linked here, but they are way to specific:
EDIT
This is very exciting. I found this database via sklearn. Here is the list of all categories. It looks like it contains what I am looking for.
I'll have to learn how to, and then implement the thing, so I'll have to get back to you guys if it works...
Mostly success! Though this is not a twitter optimized training data set, but seems to be more for a general text categorisation.
Ok, this was much more awkward than hoped. First of all,
from sklearn.datasets import fetch_rcv1
rcv1 = fetch_rcv1()
creates a dataset that I have no idea how to work with. The data are 47236 dimensional vectors instead of text tokens with no obvious or documented (that I could find) way on how to deal with that. So I had to do it the long route.
Looking at the datasource, one can download the token files. They are broken up into 5 pieces:
lyrl2004_tokens_train.dat,
lyrl2004_tokens_test_pt0.dat,
lyrl2004_tokens_test_pt1.dat,
lyrl2004_tokens_test_pt2.dat,
lyrl2004_tokens_test_pt3.dat,
with one file containing all the classifications:
rcv1-v2.topics.qrels
As a useful side note, for massive files like these, it is useful to just look at a bit of the data to get an idea of what you are working with. In linux, you can do head -5 rcv1-v2.topics.qrels
to look at the top 5 rows of the classification data for example.
These files can be linked via an id. So, I created a dictionary containing all ids with their corresponding text tokens and categorisations. The reason I did this with a dictionary, which is quite a slow process instead of just creating two lists containing all the values and errors is because I have no idea if the data files match up 100%.
My dictionary looks something like this:
dTrainingData = {'2286': {lsTokens: [...], lsCats: [...]}}
Then, I create 2 numpy arrays, one for the tokens and one for the categories. These need to be processed first. So, you can train the model as such:
def categorize(sText):
import numpy as np
aTokens = np.array([d['lsTokens'] for d in dTrainingData.values()], str)
lCats = [d['lsCats'] for d in dTrainingData.values()]
print("creating binary cats")
from sklearn import preprocessing
oBinarizer = preprocessing.MultiLabelBinarizer()
aBinaryCats = oBinarizer.fit_transform(lCats)
from sklearn.multiclass import OneVsRestClassifier
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
oClassifier = Pipeline([
('vectorizer', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(LinearSVC()))])
print("fitting data to classifier...")
oClassifier.fit(aTokens, aBinaryCats)
aText = np.array([sText])
aPredicted = oClassifier.predict(aText)
lAllCats = oBinarizer.inverse_transform(aPredicted)
The results are mixed. If you look at the list of categories, you will note that a lot of the categories are financial, instead of a nice even spread that I want. So I do have a lot of misses. However, it creates a solid foundation, and using the scaffold highlighted above, it looks easy to just add tokens/category to the dTrainingData
dictionary for more specific categories.