Search code examples
javahtmlbufferedreader

Reading HTML, how to skip HEAD tag info in a webpage using a BufferedReader, reading the HTML line by line?


I have a quick question that I am having a hard time figuring out. I want to read an html file line by line but I want to skip over the HEAD tag. Therefore, I figured that I could start reading the text after skipping past the HEAD tag.

So far I have created:

BufferedReader reader = new BufferedReader(new InputStreamReader(socket.getInputStream()));

StringBuilder string = new StringBuilder();
String line;
while ((line = reader.readLine()) != null) {
    if (line.startsWith("<html>")) 
        string.append(line + "\n");
}

I want to save the html code in memory without the HEAD information.

Example:

<HTML>

<HEAD>

    <TITLE>Your Title Here</TITLE>

</HEAD>

<BODY BGCOLOR="FFFFFF">

    <CENTER><IMG SRC="clouds.jpg" ALIGN="BOTTOM"> </CENTER>

    <a href="http://somegreatsite.com">Link Name</a>is a link to another nifty site

    <H1>This is a Header</H1>

    <H2>This is a Medium Header</H2>

    Send me mail at <a href="mailto:[email protected]">[email protected]</a>.

</BODY>

I want to save everything but the tag information.


Solution

  • How about something like this -

    boolean htmlFound = false;                        // Have we found an open html tag?
    StringBuilder string = new StringBuilder();       // Back to your code...
    String line;
    while ((line = reader.readLine()) != null) {
      if (!htmlFound) {                               // Have we found it yet?
        if (line.toLowerCase().startsWith("<html")) { // Check if this line opens a html tag...
          htmlFound = true;                           // yes? Excellent!
        } else {
          continue;                                   // Skip over this line...
        }
      }
      System.out.println("This is each line: " + line);
      string.append(line + "\n");
    }