Search code examples
javaxmlstreamapache-cameltokenize

Camel, split large XML file with header, using field condition


I'm trying to set up an Apache Camel route, that inputs a large XML file and then split the payload into two different files using a field condition. I.e. if an ID field starts with a 1, it goes to one output file, otherwise to another. Using Camel is not a must and I've looked at XSLT and regular Java options as well but I just feel that this should work.

I've covered splitting the actual payload but I'm having issues with making sure that the parent nodes, including a header, is included in each file as well. As the file can be large, I want to make sure that streams are used for the payload. I feel like I've read hundreds of different questions here, blog entries, etc. on this, and pretty much every case covers either loading the entire file into memory, splitting the file equally into parts og just using the payload nodes individually.

My prototype XML file looks like this:

<root>
    <header>
        <title>Testing</title>
    </header>
    <orders>
        <order>
            <id>11</id>
            <stuff>One</stuff>
        </order>
        <order>
            <id>20</id>
            <stuff>Two</stuff>
        </order>
        <order>
            <id>12</id>
            <stuff>Three</stuff>
        </order>
    </orders> 
</root>

The result should be two files - condition true (id starts with 1):

<root>
    <header>
        <title>Testing</title>
    </header>
    <orders>
        <order>
            <id>11</id>
            <stuff>One</stuff>
        </order>
        <order>
            <id>12</id>
            <stuff>Three</stuff>
        </order>
    </orders> 
</root>

Condition false:

<root>
    <header>
        <title>Testing</title>
    </header>
    <orders>
        <order>
            <id>20</id>
            <stuff>Two</stuff>
        </order>
    </orders> 
</root>

My prototype route:

from("file:" + inputFolder)
.log("Processing file ${headers.CamelFileName}")
.split()
    .tokenizeXML("order", "*") // Includes parent in every node
    .streaming()
    .choice()
        .when(body().contains("id>1"))
            .to("direct:ones")
            .stop()
        .otherwise()
            .to("direct:others")
            .stop()
    .end()
.end();

from("direct:ones")
//.aggregate(header("ones"), new StringAggregator()) // missing end condition
.to("file:" + outputFolder + "?fileName=ones-${in.header.CamelFileName}&fileExist=Append");

from("direct:others")
//.aggregate(header("others"), new StringAggregator()) // missing end condition
.to("file:" + outputFolder + "?fileName=others-${in.header.CamelFileName}&fileExist=Append");

This works as intented, except that the parent tags (header and footer, if you will) is added for every node. Using just the node in tokenizeXML returns only the node itself but I can't figure out how to add the header and footer. Preferably I would want to stream the parent tags into a header and footer property and add them before and after the split.

How can I do this? Would I somehow need to tokenize the parent tags first and would this mean streaming the file twice?

As a final note you might notice the aggregate at the end. I don't want to aggregate every node before writing to the file, as that defeats the purpose of streaming it and keep the entire file out of memory, but I figured I might gain some performance by aggregating a number of nodes before writing to the file, to lessen the perfomance hit of writing to the drive for every node. I'm not sure if this make sense to do.


Solution

  • I was unable to make it work with Camel. Or rather, when using plain Java for extracting the header, I already had everything I needed to continue and make the split and swapping back to Camel seemed cumbersome. There are most likely ways to improve on this, but this was my solution to splitting the XML payload.

    Switching between the two types of output streams is not that pretty but it eases the use of everything else. Also of note, is that I chose equalsIgnoreCase to check the tag names even though XML is normally case sensitive. For me, it reduces the risk of errors. Finally, make sure your regex match the entire string using wildcards, as per normal string regex.

    /**
     * Splits a XML file's payload into two new files based on a regex condition. The payload is a specific XML tag in the
     * input file that is repeated a number of times. All tags before and after the payload are added to both files in order
     * to keep the same structure.
     * 
     * The content of each payload tag is compared to the regex condition and if true, it is added to the primary output file.
     * Otherwise it is added to the secondary output file. The payload can be empty and an empty payload tag will be added to
     * the secondary output file. Note that the output will not be an unaltered copy of the input as self-closing XML tags are
     * altered to corresponding opening and closing tags.
     * 
     * Data is streamed from the input file to the output files, keeping memory usage small even with large files.
     * 
     * @param inputFilename Path and filename for the input XML file
     * @param outputFilenamePrimary Path and filename for the primary output file
     * @param outputFilenameSecondary Path and filename for the secondary output file
     * @param payloadTag XML tag name of the payload
     * @param payloadParentTag XML tag name of the payload's direct parent
     * @param splitRegex The regex split condition used on the payload content
     * @throws Exception On invalid filenames, missing input, incorrect XML structure, etc.
     */
    public static void splitXMLPayload(String inputFilename, String outputFilenamePrimary, String outputFilenameSecondary, String payloadTag, String payloadParentTag, String splitRegex) throws Exception {
    
        XMLInputFactory xmlInputFactory = XMLInputFactory.newInstance();
        XMLOutputFactory xmlOutputFactory = XMLOutputFactory.newInstance();
        XMLEventReader xmlEventReader = null;
        FileInputStream fileInputStream = null;
        FileWriter fileWriterPrimary = null;
        FileWriter fileWriterSecondary = null;
        XMLEventWriter xmlEventWriterSplitPrimary = null;
        XMLEventWriter xmlEventWriterSplitSecondary = null;
    
        try {
            fileInputStream = new FileInputStream(inputFilename);
            xmlEventReader = xmlInputFactory.createXMLEventReader(fileInputStream);
    
            fileWriterPrimary = new FileWriter(outputFilenamePrimary);
            fileWriterSecondary = new FileWriter(outputFilenameSecondary);
            xmlEventWriterSplitPrimary = xmlOutputFactory.createXMLEventWriter(fileWriterPrimary);
            xmlEventWriterSplitSecondary = xmlOutputFactory.createXMLEventWriter(fileWriterSecondary);
    
            boolean isStart = true;
            boolean isEnd = false;
            boolean lastSplitIsPrimary = true;
    
            while (xmlEventReader.hasNext()) {
                XMLEvent xmlEvent = xmlEventReader.nextEvent();
    
                // Check for start of payload element
                if (!isEnd && xmlEvent.isStartElement()) {
                    StartElement startElement = xmlEvent.asStartElement();
                    if (startElement.getName().getLocalPart().equalsIgnoreCase(payloadTag)) {
                        if (isStart) {
                            isStart = false;
                            // Flush the event writers as we'll use the file writers for the payload
                            xmlEventWriterSplitPrimary.flush();
                            xmlEventWriterSplitSecondary.flush();
                        }
    
                        String order = getTagAsString(xmlEventReader, xmlEvent, payloadTag, xmlOutputFactory);
                        if (order.matches(splitRegex)) {
                            lastSplitIsPrimary = true;
                            fileWriterPrimary.write(order);
                        } else {
                            lastSplitIsPrimary = false;
                            fileWriterSecondary.write(order);
                        }
                    }
                }
                // Check for end of parent tag
                else if (!isStart && !isEnd && xmlEvent.isEndElement()) {
                    EndElement endElement = xmlEvent.asEndElement();
                    if (endElement.getName().getLocalPart().equalsIgnoreCase(payloadParentTag)) {
                        isEnd = true;
                    }
                }
                // Is neither start or end and we're handling payload (most often white space)
                else if (!isStart && !isEnd) {
                    // Add to last split handled
                    if (lastSplitIsPrimary) {
                        xmlEventWriterSplitPrimary.add(xmlEvent);
                        xmlEventWriterSplitPrimary.flush();
                    } else {
                        xmlEventWriterSplitSecondary.add(xmlEvent);
                        xmlEventWriterSplitSecondary.flush();
                    }
                }
    
                // Start and end is added to both files
                if (isStart || isEnd) {
                    xmlEventWriterSplitPrimary.add(xmlEvent);
                    xmlEventWriterSplitSecondary.add(xmlEvent);
                }
            }
    
        } catch (Exception e) {
            logger.error("Error in XML split", e);
            throw e;
        } finally {
            // Close the streams
            try {
                xmlEventReader.close();
            } catch (XMLStreamException e) {
                // ignore
            }
            try {
                xmlEventReader.close();
            } catch (XMLStreamException e) {
                // ignore
            }
            try {
                xmlEventWriterSplitPrimary.close();
            } catch (XMLStreamException e) {
                // ignore
            }
            try {
                xmlEventWriterSplitSecondary.close();
            } catch (XMLStreamException e) {
                // ignore
            }
            try {
                fileWriterPrimary.close();
            } catch (IOException e) {
                // ignore
            }
            try {
                fileWriterSecondary.close();
            } catch (IOException e) {
                // ignore
            }
        }
    }
    
    /**
     * Loops through the events in the {@code XMLEventReader} until the specific XML end tag is found and returns everything
     * contained within the XML tag as a String.
     * 
     * Data is streamed from the {@code XMLEventReader}, however the String can be large depending of the number of children
     * in the XML tag.
     * 
     * @param xmlEventReader The already active reader. The starting tag event is assumed to have already been read
     * @param startEvent The starting XML tag event already read from the {@code XMLEventReader}
     * @param tag The XML tag name used to find the starting XML tag
     * @param xmlOutputFactory Convenience include to avoid creating another factory
     * @return String containing everything between the starting and ending XML tag, the tags themselves included
     * @throws Exception On incorrect XML structure
     */
    private static String getTagAsString(XMLEventReader xmlEventReader, XMLEvent startEvent, String tag, XMLOutputFactory xmlOutputFactory) throws Exception {
        StringWriter stringWriter = new StringWriter();
        XMLEventWriter xmlEventWriter = xmlOutputFactory.createXMLEventWriter(stringWriter);
    
        // Add the start tag
        xmlEventWriter.add(startEvent);
    
        // Add until end tag
        while (xmlEventReader.hasNext()) {
            XMLEvent xmlEvent = xmlEventReader.nextEvent();
    
            // End tag found
            if (xmlEvent.isEndElement() && xmlEvent.asEndElement().getName().getLocalPart().equalsIgnoreCase(tag)) {
                xmlEventWriter.add(xmlEvent);
                xmlEventWriter.close();
                stringWriter.close();
    
                return stringWriter.toString();
            } else {
                xmlEventWriter.add(xmlEvent);
            }
        }
    
        xmlEventWriter.close();
        stringWriter.close();
        throw new Exception("Invalid XML, no closing tag for <" + tag + "> found!");
    }