Search code examples
epub

Extracting ePub Excerpt


I've read about the ePub format, standard, structure, readers, tools and available developer techniques to manipulate/convert/create ePubs but there is no such thing as a magical function (so far) to extract a particular length of characters to create an excerpt of the book. And that's precisely what I'm looking for: A way to extract the first X words of an ePub.

  • The first approach I'm considering (not my favorite btw) is creating a parser to read all the ePub metadata and start parsing the xml files in the right order until I have enough words to create the excerpt of a determined ePub (I will appreciate some feedback in this direction)

  • The second way (which I can't find so far) is an existent tool/function or parser (in any language) which returns (hopefully) the plain text of the ePub so I can collect the first X words in order to create my excerpt.

Do you know about any tool which can help me achieve the second option?


Solution

  • Jose, I'm not aware of any tool to do what you want. Let me comment on your first approach, though. If you do find a tool I hope these comments allow you to evaluate it.

    I think your approach is fine and, if you want to do a good job of creating an extract, you may want to own this step anyway. I would suggest you,

    • grab the OPF file and look for a GUIDE section. If a GUIDE section exists, check the types that are given. Some are probably not relevant for an excerpt (cover,title-page,copyright-page). Many books will not have the types explicitly stated but this should help where they do.
    • now go through the files in sequence in the SPINE section, excluding anything that is irrelevant, and read through enough XHTML files to get your excerpt.
    • while in the OPF file grab a bunch of metadata if this is relevant for the excerpt (title, creator, date are mandatory, I think, and some authors will also put in a whole bunch of other metadata such as keywords).

    If you are creating a mini-EPUB with this excerpt you will need to pick up any CSS, Audio, Video, Image and Custom Font files that get referenced in the XHTML files used to make your excerpt. You may even choose to use the original cover file for the cover file of your excerpt epub.

    If you working with fixed layout books with fun stuff like Read Aloud AND you want to create a mini-EPUB as an excerpt, you may be better off going with a page count rather than a word count. Don't forget to include any SMIL files into your excerpt and to make it look nice: (i) don't split a two page spread and (ii) make sure that the first page is an odd numbered page if odd in the original or even if even numbered in the original - to do this you may need to add a blank filler page (get the odd/even wrong and subsequent two page spreads won't be facing each other)

    I hope that helps.