Search code examples
screen-scrapinghtmlunit

How to not get contents of child elements within HtmlUnit?


I have the following:

<th>
Q4/10
<br>
<span> Nov 30, 2010 </span>
</th>

and I'd like to get Q4/10 but not the date that follows. I'm not sure how to do it within HtmlUnit. I know I can split both elements by spaces and then take everything before the first space, but I'm looking for something based on the tags themselves.


Solution

  • If you know that the text you want comes before any sub elements, you can just grab its first child, which will contain your text and some whitespace:

    HtmlTableHeaderCell th = ...
    System.err.println( th.getFirstChild().toString().trim() ) ;
    

    The more general solution would be to loop through the children of th looking for text nodes, and ignoring sub elements.