Search code examples
pythonnltktext-miningepub

Extract text from epub in Python


I have written following code to extract the words of a ebook and add them to a corpus for text-mining purposes.

# loading the german corpus
from ebooklib import epub
import ebooklib
import os
import nltk
input_path = r"C:\Users\jzeh\Desktop\Directory"
german_corpus = []
book = epub.read_epub(os.path.join(input_path,'grimms-maerchen.epub'))
for doc in book.get_items():
    german_corpus += str(doc.content)
    german_corpus = [w.lower() for w in nltk.word_tokenize(german_corpus)]

Unfortunately running the code gives me the error:

TypeError  ---> 12     german_corpus = [w.lower() for w in nltk.word_tokenize(german_corpus)]
TypeError: expected string or bytes-like object

Could anyone tell me, what I am missing?


Solution

  • nltk.word_tokenize takes a string as an input, you have passed it a list. If I understand correctly, I think you want this:

    ...
    
    for doc in book.get_items():
        doc_content = str(doc.content)
        for w in nltk.word_tokenize(doc_content):
            german_corpus.append(w.lower())