How to get "HTTPS" link and How to convert epub to txt in python?

I want to convert epub to txt. I first do epub to xhtml by zipfile. and then i try to convert xhtml to epub by beautifulsoup.

However, there is a problem because of local file name. An example, MY xhtml file name is "C:\Users\abc.xhtml", not "HTTPS". So beautifulsoup isn't working.

How can i solve this?

'''
import zipfile

zf = zipfile.ZipFile('C:\\Users\\abc.epub')
zf.extractall('C:\\Users\\Desktop\\folder')
'''
import re, requests
from bs4 import BeautifulSoup
html = "C:\\Users\\abc.xhtml"

soup = BeautifulSoup(html, 'lxml')
print(soup.text)

Solution

You don't need BeautifulSoup for the extraction.

You can convert .epub files to text using the epub-conversion package, installable from PyPi:

pip install epub-conversion

Now it's a simple task to extract the text from an epub archive:

Line-by-line:

from epub_conversion.utils import open_book, convert_epub_to_lines

book = open_book("some_file.epub")

lines = convert_epub_to_lines(book)

Now, as in your question you can print it as a whole or choose to process each line:

print(lines)

# or traverse each line
for line in lines:
    print(line) # Or do something completely different