How to remove irrelevant text data from a large dataset

I am working on a ML project where data were coming from a social media, and the topic about the data should be depression under Covid-19. However, when I read some of the data retrieved, I noticed that even though the text (around 1-5 %) mentioned some covid-related keywords, the context of those texts are not actually about the pandemic, they are telling a life story (from 5-year-old to 27-year-old) instead of how covid affects their lives. The data I want to use and am looking for is some texts that tell people how covid makes depression worse and what not. Is there a general way to clean those irrelevant data whose contexts are not covid-related (or outliers )? Or is it ok to keep them in the dataset since they only count for 1-5% ?

Solution

I think what you want is Topic Modeling, or perhaps a Text Rank algo, or certainly something along those lines. Check out the link below for some ideas of where to go with this.

https://monkeylearn.com/keyword-extraction/

There are numerous weaknesses with the bag of words model, especially when applied to natural language processing tasks, that graph ranking algorithms such as TextRank are able to address. TextRank is able to incorporate word sequence information. Bag of words simply refers to a matrix in which the rows are documents and the columns are words. The values matching a document with a word in the matrix, could be a count of word occurrences within the document or use tf-idf. The bag of words matrix is then provided to a machine learning algorithm. Using word counts or tf-idf, we are only able to identify key single word terms in a document.

Also, see the link below.

https://towardsdatascience.com/topic-modeling-quora-questions-with-lda-nmf-aff8dce5e1dd

You can find the accompanying sample data used in the example in that link, directly below.

https://raw.githubusercontent.com/susanli2016/NLP-with-Python/master/data/quora_sample.csv