Search code examples
javaxmlsaxparserstaxvtd-xml

fast of retrieving data from XML


I have sample xml

<?xml version="1.0" encoding="UTF-8"?>
  <tag_1>
     <tag_2>A</tag_2>
     <tag_3>B</tag_3>
     <tag_4>C</tag_4>
     <tag_5>D</tag_5>
  </tag_1>
</xml>

Now i am interested to extract only specific data.

For example

tag_1/tag_5 -> D

tag_1/tag_5 is my data definition (the only data which i want) which is dynamic in nature that means tomorrow tag_1/tag_4 will be my data definition.

So in reality my xml is a large data set. And these xml payloads comes like 50,000/hour to 80,000/hour.

I would like to know if there already high performance xml reader tool or some special logic i can implement which extracts data depending upon data definition.

Currently i have implementation using Stax parser but its taking nearly a day to parse 80,000 xml's.

public class VTDParser {

    private final Logger LOG = LoggerFactory.getLogger(VTDParser.class);

    private final VTDGen vg;

    public VTDParser() {
        vg = new VTDGen();
    }

    public String parse(final String data, final String xpath) {
        vg.setDoc(data.getBytes());
        try {
            vg.parse(true);
        } catch (final ParseException e) {
            LOG.error(e.toString());
        }

        final VTDNav vn = vg.getNav();
        final AutoPilot ap = new AutoPilot(vn);
        try {
            ap.selectXPath(xpath);
        } catch (final XPathParseException e) {
            LOG.error(e.toString());
        }

        try {
            while (ap.evalXPath() != -1) {
                final int val = vn.getText();
                if (val != -1) {
                    return vn.toNormalizedString(val);
                }
            }
        } catch (XPathEvalException | NavException e) {
            LOG.error(e.toString());
        }
        return null;
    }
}

Solution

  • This is my mod to your code which compiles xpath once and reuse many times. It compiles the xpath without binding to a VTDNav instance. It also calls resetXPath before exiting the parse method.. I, however, didn't show you how to preindex the xml docs with VTD... to avoid repetitive parsing.... and I suspect it might be the difference maker for your project... Here is a paper reference regarding the capabilities of vtd-xml..

    http://recipp.ipp.pt/bitstream/10400.22/1847/1/ART_BrunoOliveira_2013.pdf

    import com.ximpleware.*;
    
    
    public class VTDParser {
          // private final Logger LOG = LoggerFactory.getLogger(VTDParser.class);
    
            private final VTDGen vg;
            private final AutoPilot ap;
            public VTDParser() throws VTDException{
                vg = new VTDGen();
                ap = new AutoPilot();
                ap.selectXPath("/a/b/c");// this is how you compile xpath w/o binding to an XML doc
            }
    
            public String parse(final String data, final AutoPilot ap1) {
                vg.setDoc(data.getBytes());
                try {
                    vg.parse(true);
                } catch (final ParseException e) {
                    LOG.error(e.toString());
                }
    
                final VTDNav vn = vg.getNav();
                ap1.bind(vn);
                try {
                    while (ap.evalXPath() != -1) {
                        final int val = vn.getText();
                        if (val != -1) {
                            return vn.toNormalizedString(val);
                        }
                    }
                } catch (XPathEvalException | NavException e) {
                    LOG.error(e.toString());
                }
                ap.resetXPath();// reset your xpath here
                return null;
            }
    }