machine-learning nlp text-classification

Classification of Categories in Text Data

This may be an abstract question, but I always face difficulties with this kind of problem and it keeps on coming to me.

I crawled data (example: news articles about Tata Steel) extracted the content, manually read the content of each link and classified them as Finance, Operation, Sustainability and so on.

Then I made tf-idf data frame to be the features for classifier model.

I want to train the model to classify these articles. I am only left with either SVM or Logistic using the tf-idf features.

Is there a better approach to clssify text data? Can there be better approach rather then making tf-idf as we may loose information (contextual meaning of sentence) when breaking them into words and use as features.

Any algorithm which can help me to improve classification on text data?

Solution

There are several commercial APIs as well as frameworks for text classification task that improve upon SVM/logistic on tf-idf. They include the semantic/context/word order in sentences for classification. Deep Neural Nets have been quite useful in this task and you can research LSTM and RNN test classification if you want to build a neural net from scratch. For existing and easier to get started, you can look at Spacy and FastText. Both have examples of labeling and training data for classification models