I'm doing text classification for Arabic dialects, and I need to collect data. So I'm using Twitter API to do that.
I need to find tweets that have the same dialect.
Is to collect tweets based on certain keywords only one dialect have
When I test the data, of course the accuracy will be high. Because the test data will contain those keywords that I used to collect the dataset.
Isn't there another way to circumvent this bias?
Note that this is a platform to get advice with particular code, not to discuss methodologies.
That said, you could manually collect data from this particular dialect and collect other tweets as well and then build a classifier that predicts to what group a tweet belongs.