I am trying to scrape text in instagram and check if I could find some keywords in the bio but the user use a special fonts, so I cannot identify the specific word, how can I remove the fonts or formot of a text such that I can search the word?
import re
test="๐๐ฃ๐๐๐ก๐ ๐ฉ๐๐ ๐๐ช๐ฉ๐ช๐ง๐ ๐ฉ๐๐๐ฃ ๐๐ญ๐๐๐ก๐ ๐ฉ๐๐ ๐ฅ๐๐จ๐ฉ. "
x = re.findall(re.compile('past'), test)
if x:
print("TEXT FOUND")
else:
print("TEXT NOT FOUND")
TEXT NOT FOUND
Another example:
import re
test="าสแดแดสแดษดแดแด ษขสแดแดสษชแด แด
แดsษชษขษดแดส"
test=test.lower()
x = re.findall(re.compile('graphic'), test)
if x:
print("TEXT FOUND")
else:
print("TEXT NOT FOUND")
TEXT NOT FOUND
you can use unicodedata.normalize that Return the normal form for the Unicode string. For your examples see the following code snippet:
import re
import unicodedata
test="๐๐ฃ๐๐๐ก๐ ๐ฉ๐๐ ๐๐ช๐ฉ๐ช๐ง๐ ๐ฉ๐๐๐ฃ ๐๐ญ๐๐๐ก๐ ๐ฉ๐๐ ๐ฅ๐๐จ๐ฉ. "
formatted_test = unicodedata.normalize('NFKD', test).encode('ascii', 'ignore').decode('utf-8')
x = re.findall(re.compile('past'), formatted_test)
if x:
print("TEXT FOUND")
else:
print("TEXT NOT FOUND")
and the output will be:
TEXT FOUND