Search code examples
javaregexmapreducestack-overflow

Stackoverflow when splitting string using regex


I'm doing a project in MapReduce using Amazon Web Services and I'm having this error:

FATAL [main] org.apache.hadoop.mapred.YarnChild: Error running child : java.lang.StackOverflowError at java.util.regex.Pattern$GroupHead.match(Pattern.java:4658)

I read a few other questions to understand why this happened and it seems my regex has repetitive alternative paths. This is the regex:

\\s+(?=(?:(?<=[a-zA-Z])\"(?=[A-Za-z])|\"[^\"]*\"|[^\"])*$)

What it does is that it splits by space except when they are inside these symbols < > or these " ". So basically takes strings that are inside those 2 types of symbol. I have tried many other versions but none works, so I am far away from an optimal one. I am kind of lost and it's the first time Im using these complicated regexs. Can someone please give a better option for my regex?

I would truly appreciate every feedback regarding this!

EDIT:
This string with URLs inside <> and text inside "" and spaces:
<\janhaeussler.com/?sioc_type=user&sioc_id=1/> "HEY" <.org/1999/02/22-rdf-syntax-ns#type/>

should produce these 3 Strings:
1. <\janhaeussler.com/?sioc_type=user&sioc_id=1/> (with or without <>)
2. "HEY"
3. <.org/1999/02/22-rdf-syntax-ns#type/>

EDIT 2:
I think the symbols <> are confusing. I am trying to find a regex that splits by one or more spaces without taking into consideration the spaces inside " ", since the urls do not have spaces.


Solution

  • Try this:

    \s+(?=(?:(?:[^"]*"){2})*[^"]*$)
    

    Demo

        String string = "abc d<\\janhaeussler.com/?sioc_type=user &sioc_id=1/> \"HEY 1\" 2 3 <.org/1999/02/22-rdf-syntax-ns#type/> \"tra la\" <asdfadsf sadfasdf/> 4    \"sdf sdf\" 5 6";
        String[] res=string.split("\\s+(?=(?:(?:[^\"]*\"){2})*[^\"]*$)");
        System.out.println(Arrays.toString(res));
    

    Will output:

    [abc, d<\janhaeussler.com/?sioc_type=user, &sioc_id=1/>, "HEY 1", 2, 3, <.org/1999/02/22-rdf-syntax-ns#type/>, "tra la", <asdfadsf, sadfasdf/>, 4, "sdf sdf", 5, 6]