Search code examples
javahtmljsoup

JSoup get text and inline images in order


I've got some HTML that looks like this:

<tr>
  <td>
    Some text that is interrupted by an image here:
    <a href="/item" title="item"><img alt="imageName.png" src="linkhere" width="18" height="18"></a>
    and then continues here.
  </td>
</tr>

and basically I just need a way to loop through the nodes here and add either the text or the image alt to a string with JSoup, maintaining the order of the nodes.

In the end it should look like this:

Some text that is interrupted by an image here: "imageName.png" and then continues here

So far I'm able to get the image by itself or the text by itself by using:

element.text();
//or
element.select("img").attr("alt")

but I'm having trouble getting both of them into an ordered list.

Any ideas?


Solution

  • The following code should give you the output string you are looking for. It basically loops through all the nodes in the document and determines whether or not they are text nodes or elements. If they are text nodes, it will add them to the output string. If they are elements, it will check for an image child and add the alt text to the string.

    String test = "";
    
    Element body = doc.getElementsByTag("body").first();
    List<Node> childNodes = body.childNodes();
    
    for(Node node : childNodes){
    
        if(node instanceof TextNode){
            // These are text nodes, lets see if they are empty or not and add them to the string.
            String nodeString = node.toString();
            if(nodeString != null && !nodeString.trim().isEmpty()){
                test += nodeString;
            }
        } else if (node instanceof Element) {
            // Here is an element, let's see if there is an image.
            Element element = (Element)node;
            Element image = element.children().select("img").first();
    
            if(image != null)
            {
                test += image.attr("alt");
            }
        }
    }