Search code examples
pythontwitterdatasettext-classification

What is the most efficient way to extract tweets that has certain dialect?


I'm doing text classification for Arabic dialects, and I need to collect data. So I'm using Twitter API to do that.

However, the problem is:

I need to find tweets that have the same dialect.

One solution I have is:

Is to collect tweets based on certain keywords only one dialect have

one problem with that solution is:

When I test the data, of course the accuracy will be high. Because the test data will contain those keywords that I used to collect the dataset.

what I'm looking for

Isn't there another way to circumvent this bias?


Solution

  • Note that this is a platform to get advice with particular code, not to discuss methodologies.

    That said, you could manually collect data from this particular dialect and collect other tweets as well and then build a classifier that predicts to what group a tweet belongs.