regex parsing language-agnostic extract epub

Looking to extract the text from ePubs, but remove the table of contents. Is this possible?

For an application I'm creating, I'm looking to extract the text from open source ePubs, and manipulate the text. However, I don't want the table of contents. I'd just like from Chapter 1 or the Prologue/Preface on.

Take Tom Sawyer on Project Gutenberg for example: http://www.gutenberg.org/ebooks/74

ePubs are pretty much just a ZIP file with a bunch of HTML documents. So I open the first HTML file in that above link after unzipping the ePub, and I get the first chapter as well as a bunch of table of contents that I don't want.

That's where I'm curious. Is it possible, via some metadata that I'm missing, or Regex, to remove the table of contents/detect it?

To be clear, I'm talking programmatically.

Solution

In epub2, there is a table of contents file. First, start with the container.xml. It is always in the same place with the same name in an ePub.

$unzip -p /Users/mwu/Downloads/9781434705211.epub META-INF/container.xml
<?xml version="1.0"?>
<container version="1.0" xmlns="urn:oasis:names:tc:opendocument:xmlns:container">
<rootfiles>
    <rootfile full-path="OPS/package.opf" media-type="application/oebps-package+xml"/>
</rootfiles>
</container>

That tells you that the ePub package metadata is located in OPS/package.opf. The package metadata specifies that there is a manifest of all of the files in the ePub and a spine item listing defining what order they should come in the book. The spine tag also defines where the table of contents is with the toc attribute. Also, the items listed in the spine represent the files that make up the book itself. Anything listed linear="no" is auxiliary content rather than primary content. The specification says that the first linear="yes" (which is the default value) begins the main reading order however that can contain (as is the case in this book) a table of contents as part of the book itself.

<manifest>
...
<item id="ncx" href="toc.ncx" media-type="application/x-dtbncx+xml"/>
...
</manifest>
<spine toc="ncx">
<itemref idref="my-html-cover" linear="no"/>
<itemref idref="title"/>
<itemref idref="f1"/>
<itemref idref="ded"/>
<itemref idref="contents"/>
<itemref idref="ack"/>
<itemref idref="f2"/>
<itemref idref="chapter1"/>
<itemref idref="chapter2"/>
<itemref idref="chapter3"/>
<itemref idref="chapter4"/>
<itemref idref="chapter5"/>
<itemref idref="chapter6"/>
<itemref idref="chapter7"/>
<itemref idref="b1"/>
<itemref idref="b2"/>
<itemref idref="b3"/>
<itemref idref="b4"/>
<itemref idref="copyright"/>
</spine>

This tells you that the table of contents is idenified by the ncx item in the manifest which references the toc.ncx file. Note that the path is relative to the package.opf file, so it can be found at OPS/toc.ncx

The toc.ncx file contains a navMap tag which lists navPoint tags defining the different parts of the book and references to them.

Both in the <spine> tag in the package.opf file and in the toc.ncx file, you can get a listing of the parts of the book and in what order they go in. They also both list contents.html which I think is what you want to exclude. There is nothing consistent that can identify that in-spine table of contents, nor is it guaranteed to even exist in a book. You can try scanning the spine tag as well as the contents of each spine item file for words that commonly identify a table of contents or for a series of links that reference other spine items in the book, but that may not catch everything 100% of the time.

Generally, files like that are considered part of the book and removing them is considered incorrect (accessibility is one of the bigger reasons why).

Also, note that the ePub 2 file specifications can be found at http://idpf.org/epub/201. The ePub 3 specifications are at http://idpf.org/epub/30