Search code examples
ruby-on-railsrubyxml-parsingrexml

XML parsing in Ruby


I am using a REXML Ruby parser to parse an XML file. But on a 64 bit AIX box with 64 bit Ruby, I am getting the following error:

REXML::ParseException: #<REXML::ParseException: #<RegexpError: Stack overflow in 
regexp matcher: 
/^<((?>(?:[\w:][\-\w\d.]*:)?[\w:][\-\w\d.]*))\s*((?>\s+(?:[\w:][\-\w\d.]*:)?[\w:][\-\w\d.]*\s*=\s*(["']).*?\3)*)\s*(\/)?>/mu>

The call for the same is something like this:

REXML::Document.new(File.open(actual_file_name, "r"))

Does anyone have an idea regarding how to solve this issue?


Solution

  • I almost immediately found the answer.

    The first thing I did was to search in the ruby source code for the error being thrown. I found that regex.h was responsible for this.

    In regex.h, the code flow is something like this:

    /* Maximum number of duplicates an interval can allow.  */
    #ifndef RE_DUP_MAX
    #define RE_DUP_MAX  ((1 << 15) - 1)
    #endif
    

    Now the problem here is RE_DUP_MAX. On AIX box, the same constant has been defined somewhere in /usr/include. I searched for it and found in

    /usr/include/NLregexp.h
    /usr/include/sys/limits.h
    /usr/include/unistd.h
    

    I am not sure which of the three is being used(most probably NLregexp.h). In these headers, the value of RE_DUP_MAX has been set to 255! So there is a cap placed on the number of repetitions of a regex!

    In short, the reason is the compilation taking the system defined value than that we define in regex.h!

    This also answers my question which i had asked recently: Regex limit in ruby 64 bit aix compilation

    I was not able to answer it immediately as i need to have min of 100 reputation :D :D Cheers!