I want to write an epub file from .epub to .txt and get only the text, I found an library to do it.
import epub_conversion
f = open("demofile.txt", "a")
book = open_book("razvansividra.epub")
lines = convert_epub_to_lines(book)
for line in lines:
f.writelines(str(line.encode("utf-8")))
Everything good, but the main problem is that the output is in this format:
Carte electronic\xc4\x83 publicat\xc4\x83 cu sprijinul Ministerului Afacerilor Externe \xe2\x80\x93 Departamentul Politici pentru Rela\xc8\x9bia cu Rom\xc3\xa2nii de Pretutindeni.'b' 'b'
'b''b''
Those character like "xc4" I'm assuming that they are from special characters from my language because the book was written in my language.
You're taking an unnecessary encoding/decoding round trip.
Check this little interactive session:
>>> s = 'electronică'
>>> b = s.encode('utf-8')
>>> b
b'electronic\xc4\x83'
>>> str(b)
"b'electronic\\xc4\\x83'"
s
, which you encode – you get a bytes
object (note the b'...'
notation).str()
on it, which converts it back to a string again – but not by decoding, but by using extra quotes and escape sequences.f.writelines()
, this string is decoded again internally for writing it to disk. But since it's all ASCII, that last step isn't obvious.You should make sure to open the files with the right encoding from the beginning.
Then you won't have to use line.encode('utf-8')
anymore.
Thus:
f = open("demofile.txt", "w", encoding="utf-8")
And then later:
f.writelines(lines)
Note that there's no need to do for line in lines
if you use writelines
; it's already meant to be used with an iterable of lines.
When you open the resulting file, make sure you use an editor that supports UTF-8. Notably "simple" Windows tools like Notepad typically fail to display UTF-8 files correctly.