Search code examples
javahtmljsoupjaunt-api

How to extract specific text from HTML table?


Here is my HTML file I want to extract word (pending, Next Listing Date (Likely):, 10/01/2014). I am using jaunt and JSoup.

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
   <head>
      <meta http-equiv="Content-Language" content="en-us"/>
      <meta http-equiv="Content-Type" content="text/html;url=http://allahabadhighcourt.in/casestatus/utf-8"/>
      <title>Case Status Result</title>
      <link REL="StyleSheet" href="http://allahabadhighcourt.in/alldhc.css" TYPE="text/css"/>
      <script src="http://allahabadhighcourt.in/alldhc.js" LANGUAGE="JavaScript" TYPE="text/javascript">
      <!--
      -->
      </script>
   </head>
   <body onLoad="bodyOnLoad()">
      <div CLASS="heading">
         <img BORDER="0" src="http://allahabadhighcourt.in/image/titleEN.gif" WIDTH="532" HEIGHT="30" ALT="HIGH COURT OF JUDICATURE AT ALLAHABAD"/>
      </div>
      <h4 CLASS="subheading" ALIGN="center" STYLE="margin-top: 6pt; margin-bottom: 0pt">Case Status - Allahabad</h4>
      <p ALIGN="center" STYLE="margin-top: 0; margin-bottom: 6pt">
         <img BORDER="0" src="http://allahabadhighcourt.in/image/blueline.gif" WIDTH="210" HEIGHT="1"/></p>
<table ALIGN="center" CLASS="withb" WIDTH="60%" COLS="2">
<tr><td VALIGN='top' COLSPAN='2' ALIGN='right' STYLE='font-size: 18pt'>Pending</td></tr><tr><td VALIGN='top' ALIGN='center' COLSPAN='2' STYLE='font-size: 16pt'>Criminal Misc. Bail Application : 12898 of 2013 [Etah]</td></tr><tr><td VALIGN='top' WIDTH='35%' STYLE='font-size: 14pt'>Petitioner:</td><td STYLE='font-size: 14pt'>AVANISH</td></tr><tr><td VALIGN='top' WIDTH='35%' STYLE='font-size: 14pt'>Respondent:</td><td STYLE='font-size: 14pt'>STATE OF U.P.</td></tr><tr><td VALIGN='top' WIDTH='35%' STYLE='font-size: 14pt'>Counsel (Pet.):</td><td STYLE='font-size: 14pt'>SANJEEV MISHRA</td></tr><tr><td VALIGN='top' WIDTH='35%' STYLE='font-size: 14pt'>Counsel (Res.):</td><td STYLE='font-size: 14pt'>GOVT. ADVOCATE</td></tr><tr><td VALIGN='top' WIDTH='35%' STYLE='font-size: 14pt'>Category:</td><td VALIGN='top'>Criminal Jurisdiction Application-U/s 439, Cr.p.c., For Bail (major)</td></tr><tr><td VALIGN='top' WIDTH='35%' STYLE='font-size: 14pt'>Date of Filing:</td><td VALIGN='top' STYLE='font-size: 14pt'>08/05/2013</td></tr><tr><td WIDTH='35%' STYLE='font-size: 14pt'>Last Listed on:</td><td STYLE='font-size: 14pt'>03/01/2014 in Court No. 48</td></tr><tr><td WIDTH='35%' STYLE='font-size: 14pt'>Next Listing Date (Likely):</td><td STYLE='font-size: 14pt'>10/01/2014</td></tr><tr><td COLSPAN='2'></td></tr></table><p STYLE="text-align: justify; margin-top: 16pt; margin-left: 90pt; margin-right: 90pt; font-size: 10pt">This is not an authentic/certified copy of the information regarding status of a case. Authentic/certified information may be obtained under Chapter VIII Rule 30 of Allahabad High Court Rules. Mistake, if any, may be brought to the notice of OSD (Computer).</p>
      <table ALIGN="center" WIDTH="80%" COLS="1" RULES="NONE" BORDER="0" STYLE="margin-top: 16pt">
         <tbody>
            <tr ALIGN="center" VALIGN="TOP">
               <td VALIGN="TOP" ALIGN="center">
                  <img ALT="Back" src="http://allahabadhighcourt.in/image/back.gif" WIDTH="30" HEIGHT="25" BORDER="0" onClick="location.href='indexA.html'" STYLE="cursor:pointer"/>
               </td>
            </tr>
         </tbody>
      </table>
   </body>
</html>

Solution

  • As already pointed out in some comments, it is hard to parse specific elements due to no obvious tag attributes. Though, if your table always maintain the same structure, perhaps with blank values some times, you can tell the CSS-selector in Jsoup to parse specific elements of certain indexes.

    Document doc = do you parsing here...
    
    Element pending = doc.select("table td:eq(0)").first();
    Element nextDate = doc.select("table td:eq(0)").get(9);
    Element date = doc.select("table td:eq(1)").last();
    
    System.out.println(pending.text() + "\n" + nextDate.text() + "\n" + date.text());
    

    which will output

    Pending
    Next Listing Date (Likely):
    10/01/2014
    

    Note the use of pseudo-selectors to specify the index of the elements; td:eq(0).

    If each of the elements had it's different attributes, you could select them by using the specific attribute selector, such as [attr=value], which in this case would be something like [VALIGN=top]. It's easy to see that this wouldn't have worked in your case.

    I strongly suggest that you read more about how to use the selector-syntax to parse an HTML document. Specific reading can be found here.