I want to convert epub to txt. I first do epub to xhtml by zipfile. and then i try to convert xhtml to epub by beautifulsoup.
However, there is a problem because of local file name. An example, MY xhtml file name is "C:\Users\abc.xhtml", not "HTTPS". So beautifulsoup isn't working.
How can i solve this?
'''
import zipfile
zf = zipfile.ZipFile('C:\\Users\\abc.epub')
zf.extractall('C:\\Users\\Desktop\\folder')
'''
import re, requests
from bs4 import BeautifulSoup
html = "C:\\Users\\abc.xhtml"
soup = BeautifulSoup(html, 'lxml')
print(soup.text)
You don't need BeautifulSoup for the extraction.
You can convert .epub files to text using the epub-conversion
package, installable from PyPi:
pip install epub-conversion
Now it's a simple task to extract the text from an epub archive:
from epub_conversion.utils import open_book, convert_epub_to_lines
book = open_book("some_file.epub")
lines = convert_epub_to_lines(book)
Now, as in your question you can print it as a whole or choose to process each line:
print(lines)
# or traverse each line
for line in lines:
print(line) # Or do something completely different