Search code examples
xmletltalend

Read a single file containing multiple XML line code blocks into multiple xml files in Talend


There is a single file (compressed) that effectively contains multiple XML files of the same format, so the file itself is not a valid XML; for example: The large file has below content

<?xml version='1.0' encoding='UTF-8'?>
<Proposal xmlns="a namespace">
    <ASubnode>Text</ASubNode>
    <LotsOfOtherNodes />
</Proposal>
<?xml version='1.0' encoding='UTF-8'?>
<Proposal xmlns="a namespace">
    <ASubnode>Text</ASubNode>
    <LotsOfOtherNodes />
</Proposal>
....

I would like to process all the nodes, one at a time as a single XML; For example:

<Proposal xmlns="a namespace">
    <ASubnode>Text</ASubNode>
    <LotsOfOtherNodes />
</Proposal>

The above block should be read as the first XML file by Talend and so on..

I cannot use tFileInputXML because it throws an exception upon reaching the intermediate XML declaration nodes. Could you please suggest ways on how to approach this problem?

Note: I have used an example for a similar Stack Overflow problem posted on Java


Solution

  • I suggest you split your multi-xml file into individual xml files, then read each individual file with tFileInputXML. Here's what I did to achieve that :

    enter image description here

    First read the file with tFileInputDelimited that has a single column (content), setting the row separator to "</Proposal>". This will populate the content column with the content of a single xml file (without the closing tag, since it's set as the row separator).
    Then iterate over each line, and read it with a tFixedFlowInput that has 2 columns: xmlFile (set to content) and closingTag which contains the closing tag that has been removed by tFileInputDelimited. This is then sent to a tFileOutputDelimited which writes the xml content and the closing tag next to it (notice the empty Field separator).
    The file name is dynamic so that you have numbered files. NB_FILE global variable is first set to 1 in tSetGlobalVar_1, then incremented on each generated file in tSetGlobalVar_2.
    At the end, you can simply use a tFileList with a mask like "out_*.xml" in order to iterate over and read the generated xml files with a tFileInputXML.

    enter image description here

    Here I just print the file paths to console.

    Alternative solution

    Here's a more efficient implementation of the file split part. It uses a tFileInputFullRow to read the file row by row (\n delimiter), then write each row to a file (notice the append mode in tFileOutputDelimited_1). If the line that was just written is the xml closing tag, increment the file number so that the next row is written to a different file, otherwise keep the same file number.

    enter image description here