Search code examples
javaapache-sparkrdd

String Split Error


I'm trying to split each element of a JavaRDD by space except the part in quotes and []. I'm using the following code for the purpose

    SparkConf conf = new SparkConf().setAppName("LogAnalyzer");
            JavaSparkContext sc = new JavaSparkContext(conf);
            JavaRDD<String[]> logRdd = sc.textFile(logPath).map(new Function<String, String[]>() {
                  public String[] call(String s) { 
                  return s.split("\\s+(?![^\\\\[]*\\\\])(?=(?:[^\\\"]*\\\"[^\\\"]*\\\")*[^\\\"]*$)"); 
               }
            });
            for(String[] arr : logRdd.take(10)) {
              for(String s : arr) {
                  System.out.print("| "+s+" |");
                }
              System.out.println("-------------------");
            }
            sc.close();

But I get this error on run time

18/06/05 01:07:02 ERROR executor.Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.util.regex.PatternSyntaxException: Unclosed character class near index 49
\\s+(?![^\\[]*\\])(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)
                                             ^

I don't get this when I use this in a simple java file

String[] splitted = s.split("\\s+(?![^\\[]*\\])(?=(?:[^\\\"]*\\\"[^\\\"]*\\\")*[^\\\"]*$)");

Do I need to do something else in Spark? Please let me know if any more information is required.


Solution

  • You have two extra \\ in

    public String[] call(String s) { return s.split("\\s+(?![^\\\\[]*\\\\])(?=(?:[^\\\"]*\\\"[^\\\"]*\\\")*[^\\\"]*$)"); }
    

    which should be

    public String[] call(String s) { return s.split("\\s+(?![^\\[]*\\])(?=(?:[^\\\"]*\\\"[^\\\"]*\\\")*[^\\\"]*$)"); }