I'm doing a project in MapReduce using Amazon Web Services and I'm having this error:
FATAL [main] org.apache.hadoop.mapred.YarnChild: Error running child : java.lang.StackOverflowError at java.util.regex.Pattern$GroupHead.match(Pattern.java:4658)
I read a few other questions to understand why this happened and it seems my regex has repetitive alternative paths. This is the regex:
\\s+(?=(?:(?<=[a-zA-Z])\"(?=[A-Za-z])|\"[^\"]*\"|[^\"])*$)
What it does is that it splits by space except when they are inside these symbols < >
or these " "
. So basically takes strings that are inside those 2 types of symbol. I have tried many other versions but none works, so I am far away from an optimal one. I am kind of lost and it's the first time Im using these complicated regexs. Can someone please give a better option for my regex?
I would truly appreciate every feedback regarding this!
EDIT:
This string with URLs inside <> and text inside "" and spaces:
<\janhaeussler.com/?sioc_type=user&sioc_id=1/> "HEY" <.org/1999/02/22-rdf-syntax-ns#type/>
should produce these 3 Strings:
1. <\janhaeussler.com/?sioc_type=user&sioc_id=1/> (with or without <>)
2. "HEY"
3. <.org/1999/02/22-rdf-syntax-ns#type/>
EDIT 2:
I think the symbols <> are confusing. I am trying to find a regex that splits by one or more spaces without taking into consideration the spaces inside " ", since the urls do not have spaces.
Try this:
\s+(?=(?:(?:[^"]*"){2})*[^"]*$)
String string = "abc d<\\janhaeussler.com/?sioc_type=user &sioc_id=1/> \"HEY 1\" 2 3 <.org/1999/02/22-rdf-syntax-ns#type/> \"tra la\" <asdfadsf sadfasdf/> 4 \"sdf sdf\" 5 6";
String[] res=string.split("\\s+(?=(?:(?:[^\"]*\"){2})*[^\"]*$)");
System.out.println(Arrays.toString(res));
Will output:
[abc, d<\janhaeussler.com/?sioc_type=user, &sioc_id=1/>, "HEY 1", 2, 3, <.org/1999/02/22-rdf-syntax-ns#type/>, "tra la", <asdfadsf, sadfasdf/>, 4, "sdf sdf", 5, 6]