DL/DR;
I'm using StAX
to stream parse XML files to extract some data.
The issue that I'm having is that when I run into an element
that contains both text and another element
, when I try to extract the text using xmlEventReader.getElementText()
, it throws an exception as that method is expecting that parent element
to only contain text.
<div>Hello <i>World</i>!</div>
The fact that the div
tag directly contains both text and an i
tag, is causing the text extraction to fail.
I would like to be able to extract Hello World!
from the above sample XML.
Whole Story
I'm writing a small Java app to export my recipes currently stored in Evernote and then import them into another application.
I'm just getting sick of Evernote's constant barrage to sign-up for PRO.
I'm using StAX
to stream parse the XML files which contain my notes which each contain a single recipe.
I'm able to export all of my notes from Evernote and now I need to parse those notes to extract my recipe data. The ingredients and directions are stored in the body of the note as HTML markup within a CDATA element
.
I'm basically parsing through all of the XML/HTML elements and once I get to a li
tag, I set the state that I'm within a list-item
and any text within that I just concatenate together thus removing any HTML markup that was put-in as formatting.
It's working really well, but I'm running into a small issue when there is an element
that contains both text
and another element
.
When I get to that parent element and I call xmlEventReader.getElementText()
, it throws an exception as that method is expecting that element
to only contain text.
Sample XML
I've made this sample very simple and only contains a few directions, removing the ingredients logic.
<en-note>
<ol>
<li>
<div>Add all ingredients to a </div>
<div>ziplock freezer bag</div>
</li>
<li>
<div>Freezer until needed (<i>maximum 2 months</i>).</div>
</li>
</ol>
</en-note>
Here is the code that parses the body of the note.
I've simplified this code to remove any logic pertaining to the ingredients as it's not relevant to the issue.
The above XML would be loaded into the recipeContent
variable at the top of the code sample.
The code breaks on listItemValueSb.append(xmlEventReader.getElementText().trim()).append(" ");
when it's parsing the 2nd direction due to the i
tag.
import java.io.StringReader;
import java.util.ArrayList;
import java.util.List;
import javax.xml.stream.XMLEventReader;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.events.EndElement;
import javax.xml.stream.events.StartElement;
import javax.xml.stream.events.XMLEvent;
public class Test
{
public static void main(final String[] args)
throws XMLStreamException
{
final String recipeContent = """
<en-note>
<div><i>hello</i></div>
<ol>
<li>
<div>Add all ingredients to a </div>
<div>ziplock freezer bag</div>
</li>
<li>
<div>Freezer until needed (<i>maximum 2 months</i>).</div>
</li>
</ol>
</en-note>
""";
Test.parseRecipeContent(
recipeContent,
(final List<String> directions) -> {
for (int n=0; n<directions.size(); n++)
{
System.out.println((n + 1) + ") " + directions.get(n));
}
}
);
}
private static void parseRecipeContent(
final String recipeContent,
final RecipeContentHandler recipeContentHandler
)
throws XMLStreamException
{
final XMLInputFactory xmlInputFactory = XMLInputFactory.newInstance();
XMLEventReader xmlEventReader = null;
try(
final StringReader stringReader = new StringReader(recipeContent);
)
{
xmlEventReader = xmlInputFactory.createXMLEventReader(stringReader);
XMLEvent currentEvent;
StartElement tmpStartElement;
EndElement tmpEndElement;
boolean withinLiTag = false;
StringBuilder listItemValueSb = null;
final List<String> directions = new ArrayList<>();
while (xmlEventReader.hasNext())
{
currentEvent = xmlEventReader.nextEvent();
// If the current event is an END-EVENT, then potentially end this recipe.
if (currentEvent.isEndElement())
{
tmpEndElement = currentEvent.asEndElement();
switch (tmpEndElement.getName().getLocalPart().toLowerCase())
{
case "en-note":
// Inform the calling code of the newly found recipe.
recipeContentHandler.handleRecipeContent(
directions
);
break;
case "li":
withinLiTag = false;
if (
(listItemValueSb != null)
&&
(listItemValueSb.length() > 0)
)
{
directions.add(listItemValueSb.toString().trim().replaceAll(" ", " "));
}
break;
default:
break;
}
continue;
}
// If the current event is a START-EVENT, then extract the relevant data.
if (!currentEvent.isStartElement())
{
if (
withinLiTag
&&
currentEvent.isCharacters()
)
{
listItemValueSb.append(currentEvent.asCharacters().getData().trim()).append(" ");
}
continue;
}
tmpStartElement = currentEvent.asStartElement();
switch (tmpStartElement.getName().getLocalPart().toLowerCase())
{
case "en-note":
withinLiTag = false;
directions.clear();
break;
case "li":
withinLiTag = true;
listItemValueSb = new StringBuilder();
break;
default:
// final XMLEvent nextXMLEvent = xmlEventReader.peek();
//
// if (
// (nextXMLEvent == null)
// ||
// !nextXMLEvent.isCharacters()
// )
// {
// break;
// }
if (withinLiTag)
{
try
{
listItemValueSb.append(xmlEventReader.getElementText().trim()).append(" ");
}
catch (Throwable thrown)
{
thrown.printStackTrace();
}
}
break;
}
}
}
finally
{
if (xmlEventReader != null)
{
xmlEventReader.close();
}
}
}
/* PRIVATE INTERFACES */
private static interface RecipeContentHandler
{
/**
* This method will handle the specified {@link List} of
* directions found.
*
* @param directions
* The {@link List} of directions.
*/
void handleRecipeContent(List<String> directions);
}
}
Your requirement seems to be that all text inside li
is important regardless of the formatting. The use of xmlEventReader.peek()
event does not handle this case well.
Instead it would be easier if you grabbed all the content text under li
at the branch where you currently skip each event and "continue", changing here:
// If the current event is a START-EVENT, then extract the relevant data.
if (!currentEvent.isStartElement()) {
// ensures all character content inside <li> is recorded
if (withinLiTag && currentEvent.isCharacters()) {
listItemValueSb.append(currentEvent.asCharacters().getData().trim()).append(" ");
}
continue;
}
The above line collects the trimmed inner text of li
, and makes the code which peeks the stream in the "default" case of the last switch unnecessary ), so just comment out:
switch (tmpStartElement.getName().getLocalPart().toLowerCase()) {
...
default:
// comment out the default handling:
// final XMLEvent nextXMLEvent = xmlEventReader.peek();
// ...
break;