Search code examples
pythonutf-8epub

Output format of an .epub conversion with utf-8 is bad


I want to write an epub file from .epub to .txt and get only the text, I found an library to do it.

import epub_conversion

f = open("demofile.txt", "a")
book = open_book("razvansividra.epub")
lines = convert_epub_to_lines(book)

for line in lines:
    f.writelines(str(line.encode("utf-8")))

Everything good, but the main problem is that the output is in this format:

Carte electronic\xc4\x83 publicat\xc4\x83 cu sprijinul Ministerului Afacerilor Externe \xe2\x80\x93 Departamentul Politici pentru Rela\xc8\x9bia cu Rom\xc3\xa2nii de Pretutindeni.'b' 'b'

'b''b''

Those character like "xc4" I'm assuming that they are from special characters from my language because the book was written in my language.


Solution

  • You're taking an unnecessary encoding/decoding round trip.

    Check this little interactive session:

    >>> s = 'electronică'
    >>> b = s.encode('utf-8')
    >>> b
    b'electronic\xc4\x83'
    >>> str(b)
    "b'electronic\\xc4\\x83'"
    
    • First, you have a string s, which you encode – you get a bytes object (note the b'...' notation).
    • Then you call str() on it, which converts it back to a string again – but not by decoding, but by using extra quotes and escape sequences.
    • When you call f.writelines(), this string is decoded again internally for writing it to disk. But since it's all ASCII, that last step isn't obvious.

    You should make sure to open the files with the right encoding from the beginning. Then you won't have to use line.encode('utf-8') anymore.

    Thus:

    f = open("demofile.txt", "w", encoding="utf-8")
    

    And then later:

    f.writelines(lines)
    

    Note that there's no need to do for line in lines if you use writelines; it's already meant to be used with an iterable of lines.

    When you open the resulting file, make sure you use an editor that supports UTF-8. Notably "simple" Windows tools like Notepad typically fail to display UTF-8 files correctly.