Search code examples
javaxmlstax

How do I keep track of parsing progress of large files in StAX?


I'm processing large (1TB) XML files using the StAX API. Let's assume we have a loop handling some elements:

XMLInputFactory fac = XMLInputFactory.newInstance();
 XMLStreamReader reader = fac.createXMLStreamReader(new FileReader(inputFile));
   while (true) {
       if (reader.nextTag() == XMLStreamConstants.START_ELEMENT){
            // handle contents
       }
}

How do I keep track of overall progress within the large XML file? Fetching the offset from reader works fine for smaller files:

int offset = reader.getLocation().getCharacterOffset();

but being an Integer offset, it'll probably only work for files up to 2GB...


Solution

  • A simple FilterReader should work.

    class ProgressCounter extends FilterReader {
        long progress = 0;
    
        @Override
        public long skip(long n) throws IOException {
            progress += n;
            return super.skip(n);
        }
    
        @Override
        public int read(char[] cbuf, int off, int len) throws IOException {
            int red = super.read(cbuf, off, len);
            progress += red;
            return red;
        }
    
        @Override
        public int read() throws IOException {
            int red = super.read();
            progress += red;
            return red;
        }
    
        public ProgressCounter(Reader in) {
            super(in);
        }
    
        public long getProgress () {
            return progress;
        }
    }