A function for cleaning a text
def clean_before_tok(text):
text=text.replace("'"," ")
exclude=[" le "," la "," l "," un "," une "," du "," de "," les "," des "," s "," d "]
for e in exclude:
text=text.replace(e," ")
return text
I can test this on a pet example
test=clean_before_tok("dlkj dfg le se d'ac")
print(test)
>>> dlkj dfg se ac
But when reading from file with
generated_text=open("text-like.txt", 'rb').read().decode(encoding='utf-8')
It's not finding-replacing apostrophes. Is there an encoding flaws?
In order to check the encoding of the file, you may print it as bytes
>>> with open("my-file.txt", "rb") as file:
... b_file = file.read()
>>> print(b_file)
If the apostrophes shows like apostrophes it's very weird. Normally the issue will be explained by the presence of weird \xAB
(AB
can be any letters uppercase or lowercase, they represent a non-ASCII byte) in your text.