Search code examples
xmlxml-parsingapache-nifi

Extract attributes from xml in nifi


I have these xml files where I get them from ftp (with list and fetch ftp processor). I want to get the values from the xml file and replace the file with these values as it was a csv . (and put them back to ftp with putFtp processor)

The desired output is something like this:

{"foodate":"somedate","name":"fooid1_foovalue","value":5.44}
{"foodate":"somedate","name":"fooid1_metrics","value":some-metrics}
.
.
.
{"foodate":"somedate","name":"fooid2_foovalue","value":2.34}
.
.
.

So for each id write first foodate attribute and then id1 , sample - attribute 1, id1, sample - attribute 2, etc.

However each time I dont know the name or how many the attributes will be.Only that the first sample attribute will be foodate. Any idea how to procceed? I tried with executeScript processor and js but it seems to not recognize DOMParser() etc.

<?xml version="1.0" encoding="ISO-8859-1"?>
<Document Version="2">
    <ExportData lowerBound="2021/11/24 16:58:26" upperBound="2021/11/24 22:58:26">
        <Site name="name" f="">
            <Kapta fooid1="some-number">
                <Infos>
                    <Info>
                        <EndPoint foo="value-name" />
                    </Info>
                </Infos>
                <Samples ordering="desc">
                    <Sample foodate="some-date" foovalue="5.44" metrics="some-metrics" metrics2="metrics-again" value="numbers5" te="numbers" />
                    <Sample foodate="some-date" foovalue="7.45" foom="some-metrics" metrics453="metrics-again" otherattribut="numbers5" att345="numbers" morevalues="numbers" foohdeiurf="numbers" hello="numbers"/>
                </Samples>
            </Kapta>
            <Kapta fooid2="some-number">
                <Infos>
                    <Info>
                        <EndPoint foo="value-name" />
                    </Info>
                </Infos>
                <Samples ordering="desc">
                    <Sample foodate="some-date" foovalue="2.34" metrics="some-metrics" metrics2="metrics-again" value="numbers" te="numbersagain" />
                    <Sample foodate="some-date" foo="99.8" metrics="some-metrics" metrics2="metrics-again" value="numbers" te="numbers" />
                    <Sample foodate="some-date" attr="234.56" someothermetrics="some-metrics" metr="metrics-again" anothervalue="numbers" />
                </Samples>
            </Kapta>
        </Site>
    </ExportData>
</Document>

Thanks a lot for your time and effort!


Solution

  • You can use groovy xml parser libraries. There are lots of option according to your needs, check this

    Here is an experimental code, it gets the xml from content of incoming flow file and outputs some extractions as json list. You can develop it with your requirement

    Please note that this code may not be production grade. See ExecuteScript cookbook for more about Groovy in Nifi

    import org.apache.nifi.flowfile.FlowFile;
    import org.apache.commons.io.IOUtils
    import org.apache.nifi.processor.io.InputStreamCallback
    import org.apache.nifi.processor.io.StreamCallback
    import java.nio.charset.StandardCharsets
    import javax.xml.parsers.DocumentBuilder;
    import javax.xml.parsers.DocumentBuilderFactory;
    import org.w3c.dom.Document;
    import groovy.xml.dom.DOMCategory
    import groovy.json.JsonGenerator
    
    def flowFile
    
    try {
        
        flowFile = session.get()
        
        DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
        DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
        Document doc = null
    
        session.read(flowFile, {inputStream ->
            doc =  dBuilder.parse(inputStream)
        } as InputStreamCallback)
        
        def root = doc.documentElement
        def sb = new StringBuilder()
        def jsonGenerator = new JsonGenerator.Options().disableUnicodeEscaping().build()
        
        // get a specific attribute
        use(DOMCategory) {
             root['ExportData']['Site']['Kapta']['Infos']['Info']['*'].findAll { node ->
                def data = new LinkedHashMap()
                data.NodeName = node.name()
                data.foodate = node['@foo']
                sb.append(jsonGenerator.toJson(data))
                sb.append('\n')
            }   
        }
        
        // get all attributes of Sample under Samples
        use(DOMCategory) {
            root['ExportData']['Site']['Kapta']['Samples']['*'].findAll { node ->
                def data = new LinkedHashMap()
                data.NodeName = node.name()
                def attributesMap = node.attributes()
                for (int x = 0; x < attributesMap.getLength(); x++) {
                    data.AttrName = attributesMap.item(x).getNodeName();
                    data.AttrValue = attributesMap.item(x).getNodeValue();
                    sb.append(jsonGenerator.toJson(data))
                    sb.append('\n')
                }
                        
           }
        }   
        
        flowFile = session.write(flowFile, {inputStream, outputStream ->
            outputStream.write(sb.toString().getBytes(StandardCharsets.UTF_8))
        } as StreamCallback)
        
        session.transfer(flowFile, REL_SUCCESS)
        
    } catch (Exception e) {
        log.error('',e)
        session.transfer(flowFile, REL_FAILURE)
    }