Search code examples
pythontwittermachine-learningcategorization

Twitter/ general categorisation training corpus


Does anyone know of any good broad twitter categorisation corpora?

I am looking for broad categories such as:
- sport
- science/technology
- food
- health
- entertainment
- music
- games
- finance
- education
- politics
- television
- religion
- motor
- conflict

(I think that pretty much covers everything)

There are very nice resources linked here, but they are way to specific:

  • Reuters is specific to commodities and natural resources
  • 20Newsgroups looks like its for American newspapers
  • Medir for cardiovascular medical data

EDIT
This is very exciting. I found this database via sklearn. Here is the list of all categories. It looks like it contains what I am looking for. I'll have to learn how to, and then implement the thing, so I'll have to get back to you guys if it works...


Solution

  • Mostly success! Though this is not a twitter optimized training data set, but seems to be more for a general text categorisation.
    Ok, this was much more awkward than hoped. First of all,

    from sklearn.datasets import fetch_rcv1
    rcv1 = fetch_rcv1()
    

    creates a dataset that I have no idea how to work with. The data are 47236 dimensional vectors instead of text tokens with no obvious or documented (that I could find) way on how to deal with that. So I had to do it the long route.

    Looking at the datasource, one can download the token files. They are broken up into 5 pieces:

    lyrl2004_tokens_train.dat, 
    lyrl2004_tokens_test_pt0.dat, 
    lyrl2004_tokens_test_pt1.dat, 
    lyrl2004_tokens_test_pt2.dat, 
    lyrl2004_tokens_test_pt3.dat, 
    

    with one file containing all the classifications:

    rcv1-v2.topics.qrels
    

    As a useful side note, for massive files like these, it is useful to just look at a bit of the data to get an idea of what you are working with. In linux, you can do head -5 rcv1-v2.topics.qrels to look at the top 5 rows of the classification data for example.

    These files can be linked via an id. So, I created a dictionary containing all ids with their corresponding text tokens and categorisations. The reason I did this with a dictionary, which is quite a slow process instead of just creating two lists containing all the values and errors is because I have no idea if the data files match up 100%.

    My dictionary looks something like this:
    dTrainingData = {'2286': {lsTokens: [...], lsCats: [...]}}

    Then, I create 2 numpy arrays, one for the tokens and one for the categories. These need to be processed first. So, you can train the model as such:

    def categorize(sText):
        import numpy as np
        aTokens = np.array([d['lsTokens'] for d in dTrainingData.values()], str)
        lCats = [d['lsCats'] for d in dTrainingData.values()]
    
        print("creating binary cats")
    
        from sklearn import preprocessing
        oBinarizer = preprocessing.MultiLabelBinarizer()
        aBinaryCats = oBinarizer.fit_transform(lCats)
    
        from sklearn.multiclass import OneVsRestClassifier
        from sklearn.feature_extraction.text import TfidfTransformer
        from sklearn.svm import LinearSVC
        from sklearn.feature_extraction.text import CountVectorizer
        from sklearn.pipeline import Pipeline
    
        oClassifier = Pipeline([
            ('vectorizer', CountVectorizer()),
            ('tfidf', TfidfTransformer()),
            ('clf', OneVsRestClassifier(LinearSVC()))])
    
        print("fitting data to classifier...")
        oClassifier.fit(aTokens, aBinaryCats)
    
        aText = np.array([sText])
    
        aPredicted = oClassifier.predict(aText)
        lAllCats = oBinarizer.inverse_transform(aPredicted)
    

    The results are mixed. If you look at the list of categories, you will note that a lot of the categories are financial, instead of a nice even spread that I want. So I do have a lot of misses. However, it creates a solid foundation, and using the scaffold highlighted above, it looks easy to just add tokens/category to the dTrainingData dictionary for more specific categories.