I'm trying to split each element of a JavaRDD by space except the part in quotes and []. I'm using the following code for the purpose
SparkConf conf = new SparkConf().setAppName("LogAnalyzer");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String[]> logRdd = sc.textFile(logPath).map(new Function<String, String[]>() {
public String[] call(String s) {
return s.split("\\s+(?![^\\\\[]*\\\\])(?=(?:[^\\\"]*\\\"[^\\\"]*\\\")*[^\\\"]*$)");
}
});
for(String[] arr : logRdd.take(10)) {
for(String s : arr) {
System.out.print("| "+s+" |");
}
System.out.println("-------------------");
}
sc.close();
But I get this error on run time
18/06/05 01:07:02 ERROR executor.Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.util.regex.PatternSyntaxException: Unclosed character class near index 49
\\s+(?![^\\[]*\\])(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)
^
I don't get this when I use this in a simple java file
String[] splitted = s.split("\\s+(?![^\\[]*\\])(?=(?:[^\\\"]*\\\"[^\\\"]*\\\")*[^\\\"]*$)");
Do I need to do something else in Spark? Please let me know if any more information is required.
You have two extra \\
in
public String[] call(String s) { return s.split("\\s+(?![^\\\\[]*\\\\])(?=(?:[^\\\"]*\\\"[^\\\"]*\\\")*[^\\\"]*$)"); }
which should be
public String[] call(String s) { return s.split("\\s+(?![^\\[]*\\])(?=(?:[^\\\"]*\\\"[^\\\"]*\\\")*[^\\\"]*$)"); }