Search code examples
pythonutf-8encode

’ instead of ' in Natural Reader after encoding with utf-8


I have some text that I got from the web. After processing, it is written to a txt file with

text_file = open("input.txt", "w")
text_file.write(finaltext.encode('utf-8'))
text_file.close()

When i open the txt file, everything is fine. But when I load it into Natural Reader to turn into audio. I see ’ instead of ' only on some not all the '

what to do?


Solution

  • If you're opening the file with a native text editor and it looks fine, the issue is likely with your other program which isn't correctly detecting the encoding and mojibaking it up. As mentioned in comments, it's almost assuredly a Unicode quote character that looks like an ' but isn't.

    my_string = ('The Knights who say '
        '\N{LEFT SINGLE QUOTATION MARK}'
        'Ni!'
        '\N{RIGHT SINGLE QUOTATION MARK}'
    )
    def print_repr_escaped(x):
        print(repr(x.encode('unicode_escape').decode('ascii')))
    
    print_repr_escaped(my_string)
    # 'The Knights who say \\u2018Ni!\\u2019'
    

    If you can't control the encoding of the other program, you have 2 options:

    1. Drop all Unicode characters like so:

      stripped = my_string.encode('ascii', 'ignore').decode('ascii')
      print_repr_escaped(stripped)
      # 'The Knights who say Ni!'
      
    2. Attempt to convert Unicode characters to ASCII with something like Unidecode

      import unidecode
      
      converted = unidecode.unidecode(my_string)
      print_repr_escaped(converted)
      # "The Knights who say 'Ni!'"