There is a single file (compressed) that effectively contains multiple XML files of the same format, so the file itself is not a valid XML; for example: The large file has below content
<?xml version='1.0' encoding='UTF-8'?>
<Proposal xmlns="a namespace">
<ASubnode>Text</ASubNode>
<LotsOfOtherNodes />
</Proposal>
<?xml version='1.0' encoding='UTF-8'?>
<Proposal xmlns="a namespace">
<ASubnode>Text</ASubNode>
<LotsOfOtherNodes />
</Proposal>
....
I would like to process all the nodes, one at a time as a single XML; For example:
<Proposal xmlns="a namespace">
<ASubnode>Text</ASubNode>
<LotsOfOtherNodes />
</Proposal>
The above block should be read as the first XML file by Talend and so on..
I cannot use tFileInputXML because it throws an exception upon reaching the intermediate XML declaration nodes. Could you please suggest ways on how to approach this problem?
Note: I have used an example for a similar Stack Overflow problem posted on Java
I suggest you split your multi-xml file into individual xml files, then read each individual file with tFileInputXML
. Here's what I did to achieve that :
First read the file with tFileInputDelimited
that has a single column (content), setting the row separator to "</Proposal>
". This will populate the content column with the content of a single xml file (without the closing tag, since it's set as the row separator).
Then iterate over each line, and read it with a tFixedFlowInput
that has 2 columns: xmlFile (set to content) and closingTag which contains the closing tag that has been removed by tFileInputDelimited
. This is then sent to a tFileOutputDelimited
which writes the xml content and the closing tag next to it (notice the empty Field separator).
The file name is dynamic so that you have numbered files. NB_FILE
global variable is first set to 1 in tSetGlobalVar_1
, then incremented on each generated file in tSetGlobalVar_2
.
At the end, you can simply use a tFileList
with a mask like "out_*.xml" in order to iterate over and read the generated xml files with a tFileInputXML
.
Here I just print the file paths to console.
Alternative solution
Here's a more efficient implementation of the file split part. It uses a tFileInputFullRow
to read the file row by row (\n
delimiter), then write each row to a file (notice the append mode in tFileOutputDelimited_1
). If the line that was just written is the xml closing tag, increment the file number so that the next row is written to a different file, otherwise keep the same file number.