scikit-learn: classifying texts using custom labels

I have a large training set of words labeled pos and neg to classify texts. I used TextBlob (according to this tutorial) to classify texts. While it works fairly well, it can be very slow for a large training set (e.g. 8k words).

I would like to try doing this with scikit-learn but I'm not sure where to start. What would the above tutorial look like in scikit-learn? I'd also like the training set to include weights for certain words. Some that should pretty much guarantee that a particular text is classed as "positive" while others guarantee that it's classed as "negative". And lastly, is there a way to imply that certain parts of the analyzed text are more valuable than others?

Any pointers to existing tutorials or docs appreciated!

Solution

There is an excellent chapter on this topic in Sebastian Raschka's Python Machine Learning book and the code can be found here: https://github.com/rasbt/python-machine-learning-book/blob/master/code/ch08/ch08.ipynb.

He does sentiment analysis (what you are trying to do) on an IMDB dataset. His data is not as clean as yours - from the looks of it - so he needs to do a bit more pre-processing work. Your problem can be solved with these steps:

Create numerical features by vectorizing your text: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html
Train test split: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
Train and test your favourite model, e.g.: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html