Search code examples
pythontextencodingnlpreadfile

Python reading text from file not finding apostrophes


A function for cleaning a text

def clean_before_tok(text):
    text=text.replace("'"," ")
    exclude=[" le "," la "," l "," un "," une "," du "," de "," les "," des "," s "," d "]
    for e in exclude:
        text=text.replace(e," ")
    return text

I can test this on a pet example

test=clean_before_tok("dlkj dfg le se d'ac")
print(test)
>>> dlkj dfg se ac

But when reading from file with

generated_text=open("text-like.txt", 'rb').read().decode(encoding='utf-8')

It's not finding-replacing apostrophes. Is there an encoding flaws?


Solution

  • In order to check the encoding of the file, you may print it as bytes

    >>> with open("my-file.txt", "rb") as file:
    ...     b_file = file.read()
    >>> print(b_file)
    

    If the apostrophes shows like apostrophes it's very weird. Normally the issue will be explained by the presence of weird \xAB (AB can be any letters uppercase or lowercase, they represent a non-ASCII byte) in your text.