Search code examples
pythonweb-scraping

Python exclude special characters and non-English alphabet


I am working on a scraping script for python. I don't want to scrape non-English letters and special characters.

I am using this code to get rid of most symbols/characters/flags that I don't need:

 emoji_pattern = re.compile("["
                                                                u"\U0001F600-\U0001F64F"  # emoticons
                                                                u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                                                                u"\U0001F680-\U0001F6FF"  # transport & map symbols
                                                                u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                                                                u"\U00002500-\U00002BEF"  # chinese char
                                                                u"\U00002702-\U000027B0"
                                                                u"\U00002702-\U000027B0"
                                                                u"\U000024C2-\U0001F251"
                                                                u"\U0001f926-\U0001f937"
                                                                u"\U00010000-\U0010ffff"
                                                                u"\u2640-\u2642" 
                                                                u"\u2600-\u2B55"
                                                                u"\u200d"
                                                                u"\u23cf"
                                                                u"\u23e9"
                                                                u"\u231a"
                                                                u"\ufe0f"  # dingbats
                                                                u"\u3030"
                                                           "]+", re.UNICODE)

Unforunately this code still ignores text like this:

vɒs səˈvɑːnt
meɪhər ʃælæl ˈhæʃ bɑːz
מַהֵר שָׁלָל חָשׁ בַּז
Mahēr šālāl ḥāš baz

How can I get rid of these as well?


Solution

  • Does it filter enough?

    import re
    
    
    string = '''English text? vɒs səˈvɑːnt
    
    \U0001F600 \U0001F64F
    meɪhər ʃælæl ˈhæʃ bɑːz
    מַהֵר שָׁלָל חָשׁ בַּז
    Mahēr šālāl ḥāš baz'''
    
    
    print(re.sub('[^\sA-Za-z0-9.!?\\-]+','', string))
    

    Output:

    English text? vs svnt
    
     
    mehr ll h bz
       
    Mahr ll  baz
    

    I was not sure if you need punctuation. If not - use this pattern [^\sA-Za-z0-9]