Search code examples
jsonscalaapache-sparkhadoop-yarncurly-braces

Scala via Spark with yarn - curly brackets string missing


I made some scala code and it looks like this.

object myScalaApp {
    def main(args: Array[String]) : Unit = {
        val strJson = args.apply(0)
        println( "strJson : " + strJson)

and call this scala jar file from yarn.

Process spark = new SparkLauncher()
.setAppResource("/usr/local/myJar/myApp.jar")
.setMainClass("com.myScalaApp")
.setMaster("yarn")
.setDeployMode( "cluster")
.addAppArgs( data)
.launch();

When I set json string like below

{\"aaa\" : \"a1111\",\"bbbb\" : \"b1111\"}

it print below (as I expect)

strJson : {"aaa" : "a1111","bbbb" : "b1111"}

BUT when I set json string like below

{\"aaa\" : \"a1111\",\"bbbb\" : \"b1111\",\"ccc\" : {\"c1\" : \"c111\"}}

it print below

strJson : {"aaa" : "a1111","bbbb" : "b1111","ccc" : {"c1" : "c111"

Why do all close curly bracket disappear?


extra sample

1

\"{\"aaa\" : \"a1111\",\"bbbb\" : \"b1111\",\"ccc\" : {\"c1\" : \"c111\"}}\"

strJson : "{"aaa" : "a1111","bbbb" : "b1111","ccc" : {"c1" : "c111""

2

{\"aaa\" : \"a1111\",\"bbbb\" : \"b1111\",\"ccc\" : {\"c1\" : \"c111\"}a} strJson : {"aaa" : "a1111","bbbb" : "b1111","ccc" : {"c1" : "c111"}a}


Solution

  • This problem happens because of the way YARN tries to replace parameter expansion markers {{ and }} in your command with references to environment variables.

    If you for example pass run_job.sh {{MY_VARIABLE}} to YARN it will convert it to run_job.sh $MY_VARIABLE so that environment variable would be used.

    So this issue will happen if you have JSON (or other things with two curly braces next to each other) with nested objects in your command line. This only happens when you use YARN as master and cluster deploy mode. Spark standalone and YARN client mode are not affected.

    To fix this issue either use other data format than JSON or make sure that you don't have two curly braces next to each other.

    For example with Python you could quickly fix this issue like this:

    def fix_json_for_yarn(json_string):
        # See https://issues.apache.org/jira/browse/SPARK-17814
        # Due to that YARN bug we need to make sure that our json string
        # doesn't contain {{ or }} because those get replaced by YARN.
        return json_string.replace("}}", "} }").replace("{{", "{ {")
    

    You can see the problematic YARN code here:

      @VisibleForTesting
      public static String expandEnvironment(String var,
          Path containerLogDir) {
        var = var.replace(ApplicationConstants.LOG_DIR_EXPANSION_VAR,
          containerLogDir.toString());
        var =  var.replace(ApplicationConstants.CLASS_PATH_SEPARATOR,
          File.pathSeparator);
    
        // replace parameter expansion marker. e.g. {{VAR}} on Windows is replaced
        // as %VAR% and on Linux replaced as "$VAR"
        if (Shell.WINDOWS) {
          var = var.replaceAll("(\\{\\{)|(\\}\\})", "%");
        } else {
          var = var.replace(ApplicationConstants.PARAMETER_EXPANSION_LEFT, "$");
          var = var.replace(ApplicationConstants.PARAMETER_EXPANSION_RIGHT, "");
        }
        return var;
      }
    

    See ticket of the problem here: https://issues.apache.org/jira/browse/SPARK-17814