Search code examples
javaandroidxmlparsingepub

Extract text between two links in HTML through Java


I am trying to retrieve the text data from an ePub file using Java. The text of the ePub file lies within a HTML file that is formatted something like this -

<h2 id="pgepubid00001">Chapter I</h2>

<p>Some text</p>
<p>Another line of Text</p>

<br/>

<h2 id="pgepubid00001">Chapter II</h2>

etc..

Before opening this file I already know the id of the Chapter I need to extract and can find the id of the next chapter too. Because of this I thought a logical approach would be to attempt to parse it in a SAX parser and extract the text in each paragraph until I reached the link of the next chapter. But this is proving quite a task.

Of course, everything is dynamic so there is no set link to go to etc. The HTML is semi-strictly formatted so I didn't expect parsing to be so much of a problem. Can anyone recommend a good way to extract the text needed?

The solution needs to be JAVA ONLY, no other languages can be used. I am looking to implement this in an Android device


Solution

  • Well, you know ids of the chapters, why not use String.indexOf ?

    start = text.indexOf("<h2 id=\"pgepubid00001\">");
    end = text.indexOf("<h2 id=\"pgepubid00002\">");
    
    whatYoureLookingFor = text.substring(start, end-start)
    

    Keep it simple.