is there a way to remove the comments from a huge xml file (>200 MB), parsed by vtd-xml ?
Both, comments before the root element
<!-- comment -->
<rootElement>
.
.
.
</rootElement>
and comments within
<rootElement>
<book>
<!-- comment -->
</book>
</rootElement>
The best solution would be with xPath. I tried
//comment()
which works with DOM but not with vtd-xml
Here is my code for selecting comments
String xPath = "//comment()"
XMLModifier xm = new XMLModifier();
VTDGen vg = new VTDGen();
if (vg.parseFile(fnIn,true)){
VTDNav vn = vg.getNav();
xm.bind(vn);
nodeXpath(xPath,vn);
}
private void nodeXpath(String xPath, VTDNav vn) throws Exception{
int result;
AutoPilot ap = new AutoPilot();
ap.selectXPath(xPath);
ap.bind(vn);
while((result = ap.evalXPath())!=-1){
int p = vn.getText();
if (p!=-1) {
System.out.println(vn.getText() + ", " + vn.toString(p));
}
}
}
But the nothing is printed to screen here.
Is there a way to do that with vtd xml?
Thanks for your help.
You mentioned that your code prints nothing to the screen... not even commas? I wouldn't expect it to necessarily print anything from getText()
, since the doc for getText()
seems to indicate that it returns "the type character data or CDATA", which I don't think includes the content of a comment. (Thank you, @vtd-xml-author, for confirming that.)
A good test would be to print something in every iteration of your while loop before p = vn.getText()
, so you'll know whether it's finding the comments at all.
If it is finding the comments, I think you'll want to call xm.removeToken(result)
on each one.