Search code examples
pythonclassificationword2vec

Using semantic word representation (e.g. word2vec) to build a classifier


I want to build a classifier for forum posts that will automatically categorize these posts into some defined categories(so multiclass classification not only binary classification) by using semantic word representations. For this task I want to make use of word2vec and doc2vec and check the feasability of using these models to support a fast selection of training data for the classifier. At this moment I have tried both models and they work like charm. However, as I do not want to manually label each sentence to predict what is it describing, I want to leave this task for the word2vec or doc2vec models. So, my question is : what algorithm can I use in Python for the classifier? ( I was thinking to apply some clustering over word2vec or doc2vec - manually label each cluster (this would require some time and is not the best solution). Previously, I made use of "LinearSVC"(from SVM) and OneVsRestClassifier, however, I labeled each sentence (by manually training a vector "y_train" ) in order to predict to which class a new test sentence would belong to. What would be a good alghorithm and method in python to use for this type of classifier(making use of semantic word representations to train data)?


Solution

  • The issue with things like word2vec/doc2vec and so on - actually any usupervised classifier - is that it just uses context. So, for example if I have a sentence like "Today is a hot day" and another like "Today is a cold day" it thinks hot and cold are very very similar and should be in the same cluster.

    This makes it pretty bad for tagging. Either way, there is a good implementation of Doc2Vec and Word2Vec in gensim module for python - you can quickly use the google-news dataset's prebuilt binary and test whether you get meaningful clusters.

    The other way you could try is implement a simple lucene/solr system on your computer and begin tagging a few sentences randomly. Over time lucene/solr will suggest tags clearfor your document, and they do come out to be pretty decent tags if your data is not really bad.

    The issue here is the problem youre trying to solve isnt particularly easy nor is completely solvable - If you have very good/clear data, then you may be able to auto classify about 80-90% of your data ... but if it is bad, you wont be able to auto classify it much.