python twitter machine-learning categorization

Twitter/ general categorisation training corpus

Does anyone know of any good broad twitter categorisation corpora?

I am looking for broad categories such as:
- sport
- science/technology
- food
- health
- entertainment
- music
- games
- finance
- education
- politics
- television
- religion
- motor
- conflict

(I think that pretty much covers everything)

There are very nice resources linked here, but they are way to specific:

Reuters is specific to commodities and natural resources
20Newsgroups looks like its for American newspapers
Medir for cardiovascular medical data

EDIT
This is very exciting. I found this database via sklearn. Here is the list of all categories. It looks like it contains what I am looking for. I'll have to learn how to, and then implement the thing, so I'll have to get back to you guys if it works...

Solution

Mostly success! Though this is not a twitter optimized training data set, but seems to be more for a general text categorisation.
Ok, this was much more awkward than hoped. First of all,

from sklearn.datasets import fetch_rcv1
rcv1 = fetch_rcv1()

creates a dataset that I have no idea how to work with. The data are 47236 dimensional vectors instead of text tokens with no obvious or documented (that I could find) way on how to deal with that. So I had to do it the long route.

Looking at the datasource, one can download the token files. They are broken up into 5 pieces:

lyrl2004_tokens_train.dat, 
lyrl2004_tokens_test_pt0.dat, 
lyrl2004_tokens_test_pt1.dat, 
lyrl2004_tokens_test_pt2.dat, 
lyrl2004_tokens_test_pt3.dat,

with one file containing all the classifications:

rcv1-v2.topics.qrels

As a useful side note, for massive files like these, it is useful to just look at a bit of the data to get an idea of what you are working with. In linux, you can do head -5 rcv1-v2.topics.qrels to look at the top 5 rows of the classification data for example.

These files can be linked via an id. So, I created a dictionary containing all ids with their corresponding text tokens and categorisations. The reason I did this with a dictionary, which is quite a slow process instead of just creating two lists containing all the values and errors is because I have no idea if the data files match up 100%.

My dictionary looks something like this:
dTrainingData = {'2286': {lsTokens: [...], lsCats: [...]}}

Then, I create 2 numpy arrays, one for the tokens and one for the categories. These need to be processed first. So, you can train the model as such:

def categorize(sText):
    import numpy as np
    aTokens = np.array([d['lsTokens'] for d in dTrainingData.values()], str)
    lCats = [d['lsCats'] for d in dTrainingData.values()]

    print("creating binary cats")

    from sklearn import preprocessing
    oBinarizer = preprocessing.MultiLabelBinarizer()
    aBinaryCats = oBinarizer.fit_transform(lCats)

    from sklearn.multiclass import OneVsRestClassifier
    from sklearn.feature_extraction.text import TfidfTransformer
    from sklearn.svm import LinearSVC
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.pipeline import Pipeline

    oClassifier = Pipeline([
        ('vectorizer', CountVectorizer()),
        ('tfidf', TfidfTransformer()),
        ('clf', OneVsRestClassifier(LinearSVC()))])

    print("fitting data to classifier...")
    oClassifier.fit(aTokens, aBinaryCats)

    aText = np.array([sText])

    aPredicted = oClassifier.predict(aText)
    lAllCats = oBinarizer.inverse_transform(aPredicted)

The results are mixed. If you look at the list of categories, you will note that a lot of the categories are financial, instead of a nice even spread that I want. So I do have a lot of misses. However, it creates a solid foundation, and using the scaffold highlighted above, it looks easy to just add tokens/category to the dTrainingData dictionary for more specific categories.