Search code examples
jsoup

Parsing complicated XML using Jsoup


I'm trying to parse an XML-formatted document with Jsoup, specifically what is located in the paragraph tag in the example code show below.

...
<nitf:body.content>
     <p> Content would be here. </p>
</nitf:body.content>
...

There are multiple paragraph tags in the document. As a result, I chose to use selector-syntax to get inside the body.content tag and then the paragraph tag underneath it. I am trying and failing to get it right now with:

// epochFileDoc is the name of the document with the code shown above.
Element tag_element = epochFileDoc.selectFirst("nitf|body.content > p");

I have tried a few different combinations of the selector syntax, including "nitf|content.body > p" and "nitf|body > p". None of the ones I have tried have worked.

How would I use selector-syntax in Jsoup to get the paragraph tag shown above?

EDIT: I see why content.body does not work in the selector syntax, since that searches for nitf:content="body" in the tags, but I'm still lost on how to get that element.


Solution

  • The reason why it is not possible to select using a CSS selector, like Jsoup uses, is because a dot has a special meaning in CSS (like @Shlomi Fish said). In my code, I replaced instances of nitf:body.content with nitf:body-content using the line below, where file is the string where the XML is stored:

    file = file.replace("<nitf:body.", "<nitf:body-");
    

    This allowed me to select the Element using:

    Element tag_element = epochFileDoc.selectFirst("nitf|body-content > p");
    

    It would be smarter to use a different parser for XML-formatted code in cases like this, but if you have requirements like mine/want to keep Jsoup this workaround works properly.