Search code examples
pythonpython-3.xweb-scrapingpython-repython-unicode

How do I convert a unicode text to a text that python can read so that I could find that specific word in webscraping results?


I am trying to scrape text in instagram and check if I could find some keywords in the bio but the user use a special fonts, so I cannot identify the specific word, how can I remove the fonts or formot of a text such that I can search the word?

import re
test="๐™„๐™ฃ๐™๐™–๐™ก๐™š ๐™ฉ๐™๐™š ๐™›๐™ช๐™ฉ๐™ช๐™ง๐™š ๐™ฉ๐™๐™š๐™ฃ ๐™š๐™ญ๐™๐™–๐™ก๐™š ๐™ฉ๐™๐™š ๐™ฅ๐™–๐™จ๐™ฉ. "


x = re.findall(re.compile('past'), test)
if x:    
    print("TEXT FOUND")
else:
    print("TEXT NOT FOUND")

TEXT NOT FOUND

Another example:

import re
test="า“ส€แด‡แด‡สŸแด€ษดแด„แด‡ ษขส€แด€แด˜สœษชแด„ แด…แด‡sษชษขษดแด‡ส€"
test=test.lower()

x = re.findall(re.compile('graphic'), test)
if x:    
    print("TEXT FOUND")
else:
    print("TEXT NOT FOUND")

TEXT NOT FOUND


Solution

  • you can use unicodedata.normalize that Return the normal form for the Unicode string. For your examples see the following code snippet:

    import re
    import unicodedata
    
    test="๐™„๐™ฃ๐™๐™–๐™ก๐™š ๐™ฉ๐™๐™š ๐™›๐™ช๐™ฉ๐™ช๐™ง๐™š ๐™ฉ๐™๐™š๐™ฃ ๐™š๐™ญ๐™๐™–๐™ก๐™š ๐™ฉ๐™๐™š ๐™ฅ๐™–๐™จ๐™ฉ. "
     
    formatted_test = unicodedata.normalize('NFKD', test).encode('ascii', 'ignore').decode('utf-8')
    
    x = re.findall(re.compile('past'), formatted_test)
    if x:    
        print("TEXT FOUND")
    else:
        print("TEXT NOT FOUND")
    

    and the output will be:

    TEXT FOUND