Search code examples
pythonbeautifulsoupepub

How to get "HTTPS" link and How to convert epub to txt in python?


I want to convert epub to txt. I first do epub to xhtml by zipfile. and then i try to convert xhtml to epub by beautifulsoup.

However, there is a problem because of local file name. An example, MY xhtml file name is "C:\Users\abc.xhtml", not "HTTPS". So beautifulsoup isn't working.

How can i solve this?

'''
import zipfile

zf = zipfile.ZipFile('C:\\Users\\abc.epub')
zf.extractall('C:\\Users\\Desktop\\folder')
'''
import re, requests
from bs4 import BeautifulSoup
html = "C:\\Users\\abc.xhtml"

soup = BeautifulSoup(html, 'lxml')
print(soup.text)

Solution

  • You don't need BeautifulSoup for the extraction.

    You can convert .epub files to text using the epub-conversion package, installable from PyPi:

    pip install epub-conversion
    

    Now it's a simple task to extract the text from an epub archive:

    Line-by-line:

    from epub_conversion.utils import open_book, convert_epub_to_lines
    
    book = open_book("some_file.epub")
    
    lines = convert_epub_to_lines(book)
    
    

    Now, as in your question you can print it as a whole or choose to process each line:

    print(lines)
    
    # or traverse each line
    for line in lines:
        print(line) # Or do something completely different