Search code examples
parsingunixsax

XML Parser behave differently on Unix machine for huge/big xml file only. Same code works fine at windows. WHY?


Issue --> Actually I am facing issue with xml parsing (SAX Parser) in Unix Machine. Same Jar/Java-Code behave differently on windows and Unix Machine, why ? :(

Windows Machine --> works fine , Using SAX Parser to load huge xml file , Read all values correctly and populate same values. Charset.defaultCharset() windows-1252

Unix Machine --> After then created JAR and deployed at Unix --> tomcat and execute the jar. Tried to load same huge xml file But noticed that some values or characters are populated empty or incomplete like Country Name populated as "ysia" instead of "Malaysia" or transaction Date populate as "3 PM" instead of "18/09/2016 03:31:23 PM". Charset.defaultCharset() UTF-8

Issue is only with Unix , Because when I load same xml at windows or my local eclipse it works fine and all values populate correctly.

Also I tried to modify my code and set encoding as UTF-8 for inputSteamReader but still it's not read value correctly at unix box.

Note : There is no special characters in xml. Also noticed one thing that when I take out same records (those value not populated correctly) in other xml file and load in unix machine with same jar it works fine. It means issues occur while load these records with huge data. :(

Setup Code:

SAXParserFactory saxParserFactory = SAXParserFactory.newInstance();
try {
  SAXParser saxParser = saxParserFactory.newSAXParser();
  InputStream inputStream= new FileInputStream(inputFilePath);
  Reader reader = new InputStreamReader(inputStream,"UTF-8");
  InputSource is = new InputSource(reader); 
  is.setEncoding("UTF-8"); 
  saxParser.parse(is,(DefaultHandler) handler); 
} catch(Exception ex){ 
  ex.printStackTrace();
}  

Handlers:

public void characters(char[] ac, int i, int j) throws SAXException { 
  chars.append(ac, i, j); 
  tmpValue = new String(ac, i, j).trim(); 
}


public void endElement(String s, String s1, String element) throws SAXException {
  if (element.equalsIgnoreCase("transactionDate")) {          
    obj.setTransactionDate(tmpValue); 
  }
}

Please suggest , What should be the solution ?


Solution

  • If the current read buffer ends in the middle of an element, you may get two (or more) calls to characters() for the same element -- for instance one with "Mala" and one with "ysia" -- instead of just one call with "Malaysia". In this case, your code overwrites tmpValue containing "Mala" with "ysia". To address this, you need to accumulate the content of multiple calls to characters():

    public void startElement(String uri, String localName, String qName, 
        Attributes attributes) throws SAXException {
      if(qName.equalsIgnoreCase("customerName")){ 
        chars.setLength(0); 
      }
      tmpValue = null;
    } 
    
    public void characters(char[] ac, int i, int j) throws SAXException {
      chars.append(ac, i, j);
      if (tmpValue == null) {
        tmpValue = new String(ac, i, j);
      } else {
        tmpValue += new String(ac, i, j);
      }
    }
    
    public void endElement(String s, String s1, String element) throws SAXException {
      if (element.equalsIgnoreCase("transactionDate") && tmpValue != null) {          
        obj.setTransactionDate(tmpValue.trim()); 
      }
    }