python twitter dataset text-classification

What is the most efficient way to extract tweets that has certain dialect?

I'm doing text classification for Arabic dialects, and I need to collect data. So I'm using Twitter API to do that.

However, the problem is:

I need to find tweets that have the same dialect.

One solution I have is:

Is to collect tweets based on certain keywords only one dialect have

one problem with that solution is:

When I test the data, of course the accuracy will be high. Because the test data will contain those keywords that I used to collect the dataset.

what I'm looking for

Isn't there another way to circumvent this bias?

Solution

Note that this is a platform to get advice with particular code, not to discuss methodologies.

That said, you could manually collect data from this particular dialect and collect other tweets as well and then build a classifier that predicts to what group a tweet belongs.

Webscraping Roblox
Remove the mandatory field label 'This field is required.' and fix the bug with 'clean_email'
How to plot a Probability Density Function in Python?
How large is a fresh install of Python?
Appending new elements into an empty list
Simple way to measure cell execution time in ipython notebook
PyAudio working, but spits out error messages each time
Reportlab show page number and page count IF there is more than one page in a document
How to read SharePoint Online (Office365) Excel files into Python specifically pandas with Work or School Account?
How to set a column which suffix name is based on a value in another column
Debugging Python C++ extension from Visual Studio Code on Linux
How can I get all users on Google admin_sdk?
csv.Error: iterator should return strings, not bytes
How to check if an object has an attribute?
How to use selenium with proxy auth in headless mode?
Is there a way to exit a pytest test and continue to the next one?
Returning the lowest index for the first non whitespace character in a string in Python
Formatting exceptions as Python does
Prime factorization using list comprehension in Python
Why does the power spectrum E(k) of my velocity field follow 𝑘 ^(−(n−1)) instead of 𝑘^(−n)?
How to merge dataframes over multiple columns and split rows?
How to create a Sympy IndexedBase using a custom subclass of Symbol?
Removing dynamically an element from a list
Returning boolean if set is empty
Can variables be decorated?
Fast(est) exponentiation of numpy 3D matrix
Removing an element from a list based on a condition
Printing elements of dictionary line by line
Matplotlib does not display the hatch of a patch in a legend
Python win32com - Class not registered error