python machine-learning twitter classification

Keyword-Based Classification of Tweets

I have a data set of around 40,000 Tweets. I also have 5 text files all corresponding to different categories I would like to classify the Tweets into (travel, work, vacation, etc.) Each of these text files contains certain specific keywords for the category.

For example, the text file for vacation (vacation.txt) contains flight, beach, hotel, etc.

I'd like to label my data set by mapping the keywords contained in these text files to the associated category.

For example, a Tweet containing the word "beach" would be labelled vacation.

I am using python for all of my analysis. The Tweets are contained in a .csv file.

Also, what are some other interesting approaches I could take for labeling and classifying my data? I understand that keyword-based is not the most efficient or accurate.

Solution

There can be multiple ways.

If you are just doing keyword search to label the data then I don't think that is a better approach.

Keyword approach. You will count the number of keywords match and then will assign labels accordingly, but here you will have to work on feature selection to make sure that the model isn't biased over the keyword search. Better to map the keyword for labels and then make a wordcloud to see if those keywords aren't the only ones that are coming on top. You can use tf-idf, count vectorizer and later on embedding such as glove or fasttext or probably BERT.
Clustering approach. You keep your keywords and labels aside and based on label count create that many clusters and visualize those clusters and analyze if you can find the overlapping of labels that were assigned in the 1 approach.
Use Active learning. This is a bit complex but here you assign labels to some and then let the system analyze those sparse labels and derive clusters and refine based on feedback, It is more like human in the loop concept.

Let me know if you want more detailed answer on any one of the above or more approaches.