Search code examples
javahtmlweb-scrapinghtmlelements

How can I extract text content only from root element - java, com.gargoylesoftware.htmlunit.html


I can't find any way to extract text content only from the root element using com.gargoylesoftware.htmlunit.html. Here is some example:

<td>
  W 03:10 PM-04:25 PM
  <strong>
     <br>
     Hybrid (50%+ in-person)
  </strong>
</td>

I want to extract the text content from the root element("td" in this case), but it also extract the text content from the child element, which is the part that I don't want:

private void extractTextContent(HtmlElement htmlElement) {
    String content = htmlElement.getTextContent();
    System.out.println(content);
}

output:

W 03:10 PM-04:25 PMHybrid (50%+ in-person)

desired output:

W 03:10 PM-04:25 PM

I've tried to use other method call "asText()", however that doesn't give me desired output. I couldn't find any people who has same question using com.gargoylesoftware.htmlunit.html. Is there any way/method that would extract text content only from the root element?

EDIT: Thank you for the answer. I used same idea of deleting child node to get my desired output. Here is the syntax for java:

private void extractTextContent(HtmlElement htmlElement) {
    DomNode child = htmlElement.getLastElementChild();
    String tagname = "";
    if(child != null) {
        tagname = child.getTextContent();
        htmlElement.removeChild(tagname, 0);
    }
    String content = htmlElement.getTextContent();
}

Solution

  • You can try removing child nodes before fetching textContent.

    private void extractTextContent(HtmlElement htmlElement) {
        DomNode child = htmlElement.getLastElementChild();
        String tagname = "";
        if(child != null) {
            tagname = child.getTextContent();
            htmlElement.removeChild(tagname, 0);
        }
        String content = htmlElement.getTextContent();
    }
    

    I have edited my answer with Java Syntax provided by @XYZ