I am working on a scraping script for python. I don't want to scrape non-English letters and special characters.
I am using this code to get rid of most symbols/characters/flags that I don't need:
emoji_pattern = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
u"\U00002500-\U00002BEF" # chinese char
u"\U00002702-\U000027B0"
u"\U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
u"\U0001f926-\U0001f937"
u"\U00010000-\U0010ffff"
u"\u2640-\u2642"
u"\u2600-\u2B55"
u"\u200d"
u"\u23cf"
u"\u23e9"
u"\u231a"
u"\ufe0f" # dingbats
u"\u3030"
"]+", re.UNICODE)
Unforunately this code still ignores text like this:
vɒs səˈvɑːnt
meɪhər ʃælæl ˈhæʃ bɑːz
מַהֵר שָׁלָל חָשׁ בַּז
Mahēr šālāl ḥāš baz
How can I get rid of these as well?
Does it filter enough?
import re
string = '''English text? vɒs səˈvɑːnt
\U0001F600 \U0001F64F
meɪhər ʃælæl ˈhæʃ bɑːz
מַהֵר שָׁלָל חָשׁ בַּז
Mahēr šālāl ḥāš baz'''
print(re.sub('[^\sA-Za-z0-9.!?\\-]+','', string))
Output:
English text? vs svnt
mehr ll h bz
Mahr ll baz
I was not sure if you need punctuation. If not - use this pattern [^\sA-Za-z0-9]