Search code examples
apache-sparkspark-operatorgoogle-spark-operator

spark-submit fails when submitting multiple spark applications at once using spark-on-k8s-operator


I'm trying to submit around 20 spark applications at once. This causes most of them to fail. How do I stop this from happening? The spark-operator pods are not going out of memory. The CPU does increase, but it is for a very short period. The spark-operator pod doesn't restart because of these jobs.

Logs -

10 controller.go:184] SparkApplication spark/sch-3a44a9db-7993-413e-2022-08-29t18-30-00tz00-00-1 was added, enqueuing it for submission
10 controller.go:184] SparkApplication spark/sch-3a44a9db-7993-413e-2022-08-29t18-30-00tz00-00-1 was added, enqueuing it for submission
10 controller.go:184] SparkApplication spark/sch-3a44a9db-7993-413e-2022-08-29t18-30-00tz00-00-1 was added, enqueuing it for submission
10 controller.go:184] SparkApplication spark/sch-3a44a9db-7993-413e-2022-08-29t18-30-00tz00-00-1 was added, enqueuing it for submission
10 controller.go:263] Starting processing key: "spark/sch-3a44a9db-7993-413e-2022-08-29t18-30-00tz00-00-1"
10 sparkui.go:282] Creating a service sch-3a44a9db-7993-413e-2022-08-29t18-30-00tz00-00-1-ui-svc for the Spark UI for application sch-3a44a9db-7993-413e-2022-08-29t18-30-00tz00-00-1
10 event.go:282] Event(v1.ObjectReference{Kind:"SparkApplication", Namespace:"spark", Name:"sch-3a44a9db-7993-413e-2022-08-29t18-30-00tz00-00-1", UID:"3867b989-71e6-4e47-88e9-e9d88618e269", APIVersion:"sparkoperator.k8s.io/v1beta2", ResourceVersion:"380961510", FieldPath:""}): type: 'Normal' reason: 'SparkApplicationAdded' SparkApplication sch-3a44a9db-7993-413e-2022-08-29t18-30-00tz00-00-1 was added, enqueuing it for submission
10 controller.go:184] SparkApplication spark/sch-3a44a9db-7993-413e-2022-08-29t18-30-00tz00-00-1 was added, enqueuing it for submission
10 sparkui.go:148] Creating an Ingress sch-3a44a9db-7993-413e-2022-08-29t18-30-00tz00-00-1-ui-ingress for the Spark UI for application sch-3a44a9db-7993-413e-2022-08-29t18-30-00tz00-00-1
10 submission.go:65] spark-submit arguments: [/opt/spark/bin/spark-submit --class xyz --master ... ]
10 controller.go:728] failed to run spark-submit for SparkApplication spark/sch-3a44a9db-7993-413e-2022-08-29t18-30-00tz00-00-1: failed to run spark-submit for SparkApplication spark/sch-3a44a9db-7993-413e-2022-08-29t18-30-00tz00-00-1: WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/spark/jars/spark-unsafe_2.12-3.1.1.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
22/08/30 19:41:04 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
22/08/30 19:41:08 INFO SparkKubernetesClientFactory: Auto-configuring K8S client using current context from users K8S config file
22/08/30 19:41:36 INFO KerberosConfDriverFeatureStep: You have not specified a krb5.conf file locally or via a ConfigMap. Make sure that you have the krb5.conf locally on the driver image.
I0830 19:42:00.711350      10 controller.go:822] Update the status of SparkApplication spark/sch-3a44a9db-7993-413e-2022-08-29t18-30-00tz00-00-1 from:
{
  "lastSubmissionAttemptTime": null,
  "terminationTime": null,
  "driverInfo": {},
  "applicationState": {
    "state": ""
  }
}
to:
{
  "lastSubmissionAttemptTime": "2022-08-30T19:42:00Z",
  "terminationTime": null,
  "driverInfo": {},
  "applicationState": {
    "state": "SUBMISSION_FAILED",
    "errorMessage": "failed to run spark-submit for SparkApplication spark/sch-3a44a9db-7993-413e-2022-08-29t18-30-00tz00-00-1: WARNING: An illegal reflective access operation has occurred\nWARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/spark/jars/spark-unsafe_2.12-3.1.1.jar) to constructor java.nio.DirectByteBuffer(long,int)\nWARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform\nWARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations\nWARNING: All illegal access operations will be denied in a future release\n22/08/30 19:41:04 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\nUsing Spark's default log4j profile: org/apache/spark/log4j-defaults.properties\n22/08/30 19:41:08 INFO SparkKubernetesClientFactory: Auto-configuring K8S client using current context from users K8S config file\n22/08/30 19:41:36 INFO KerberosConfDriverFeatureStep: You have not specified a krb5.conf file locally or via a ConfigMap. Make sure that you have the krb5.conf locally on the driver image.\n"
  },
  "submissionAttempts": 1
}
I0830 19:42:00.712173      10 event.go:282] Event(v1.ObjectReference{Kind:"SparkApplication", Namespace:"spark", Name:"sch-3a44a9db-7993-413e-2022-08-29t18-30-00tz00-00-1", UID:"3867b989-71e6-4e47-88e9-e9d88618e269", APIVersion:"sparkoperator.k8s.io/v1beta2", ResourceVersion:"380961510", FieldPath:""}): type: 'Warning' reason: 'SparkApplicationSubmissionFailed' failed to submit SparkApplication sch-3a44a9db-7993-413e-2022-08-29t18-30-00tz00-00-1: failed to run spark-submit for SparkApplication spark/sch-3a44a9db-7993-413e-2022-08-29t18-30-00tz00-00-1: WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/spark/jars/spark-unsafe_2.12-3.1.1.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
22/08/30 19:41:04 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
22/08/30 19:41:08 INFO SparkKubernetesClientFactory: Auto-configuring K8S client using current context from users K8S config file
22/08/30 19:41:36 INFO KerberosConfDriverFeatureStep: You have not specified a krb5.conf file locally or via a ConfigMap. Make sure that you have the krb5.conf locally on the driver image.
I0830 19:42:00.723920      10 controller.go:223] SparkApplication spark/sch-3a44a9db-7993-413e-2022-08-29t18-30-00tz00-00-1 was updated, enqueuing it
I0830 19:42:00.724098      10 controller.go:223] SparkApplication spark/sch-3a44a9db-7993-413e-2022-08-29t18-30-00tz00-00-1 was updated, enqueuing it
I0830 19:42:00.724154      10 controller.go:223] SparkApplication spark/sch-3a44a9db-7993-413e-2022-08-29t18-30-00tz00-00-1 was updated, enqueuing it
I0830 19:42:00.724353      10 controller.go:223] SparkApplication spark/sch-3a44a9db-7993-413e-2022-08-29t18-30-00tz00-00-1 was updated, enqueuing it
I0830 19:42:00.811873      10 controller.go:223] SparkApplication spark/sch-3a44a9db-7993-413e-2022-08-29t18-30-00tz00-00-1 was updated, enqueuing it
I0830 19:42:00.812538      10 controller.go:270] Ending processing key: "spark/sch-3a44a9db-7993-413e-2022-08-29t18-30-00tz00-00-1"
I0830 19:42:00.812567      10 controller.go:263] Starting processing key: "spark/sch-3a44a9db-7993-413e-2022-08-29t18-30-00tz00-00-1"
I0830 19:42:00.812839      10 controller.go:822] Update the status of SparkApplication spark/sch-3a44a9db-7993-413e-2022-08-29t18-30-00tz00-00-1 from:
{
  "lastSubmissionAttemptTime": "2022-08-30T19:42:00Z",
  "terminationTime": null,
  "driverInfo": {},
  "applicationState": {
    "state": "SUBMISSION_FAILED",
    "errorMessage": "failed to run spark-submit for SparkApplication spark/sch-3a44a9db-7993-413e-2022-08-29t18-30-00tz00-00-1: WARNING: An illegal reflective access operation has occurred\nWARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/spark/jars/spark-unsafe_2.12-3.1.1.jar) to constructor java.nio.DirectByteBuffer(long,int)\nWARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform\nWARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations\nWARNING: All illegal access operations will be denied in a future release\n22/08/30 19:41:04 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\nUsing Spark's default log4j profile: org/apache/spark/log4j-defaults.properties\n22/08/30 19:41:08 INFO SparkKubernetesClientFactory: Auto-configuring K8S client using current context from users K8S config file\n22/08/30 19:41:36 INFO KerberosConfDriverFeatureStep: You have not specified a krb5.conf file locally or via a ConfigMap. Make sure that you have the krb5.conf locally on the driver image.\n"
  },
  "submissionAttempts": 1
}
to:
{
  "lastSubmissionAttemptTime": "2022-08-30T19:42:00Z",
  "terminationTime": null,
  "driverInfo": {},
  "applicationState": {
    "state": "FAILED",
    "errorMessage": "failed to run spark-submit for SparkApplication spark/sch-3a44a9db-7993-413e-2022-08-29t18-30-00tz00-00-1: WARNING: An illegal reflective access operation has occurred\nWARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/spark/jars/spark-unsafe_2.12-3.1.1.jar) to constructor java.nio.DirectByteBuffer(long,int)\nWARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform\nWARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations\nWARNING: All illegal access operations will be denied in a future release\n22/08/30 19:41:04 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\nUsing Spark's default log4j profile: org/apache/spark/log4j-defaults.properties\n22/08/30 19:41:08 INFO SparkKubernetesClientFactory: Auto-configuring K8S client using current context from users K8S config file\n22/08/30 19:41:36 INFO KerberosConfDriverFeatureStep: You have not specified a krb5.conf file locally or via a ConfigMap. Make sure that you have the krb5.conf locally on the driver image.\n"
  },
  "submissionAttempts": 1
}
I0830 19:42:00.813582      10 event.go:282] Event(v1.ObjectReference{Kind:"SparkApplication", Namespace:"spark", Name:"sch-3a44a9db-7993-413e-2022-08-29t18-30-00tz00-00-1", UID:"3867b989-71e6-4e47-88e9-e9d88618e269", APIVersion:"sparkoperator.k8s.io/v1beta2", ResourceVersion:"380963223", FieldPath:""}): type: 'Warning' reason: 'SparkApplicationFailed' SparkApplication sch-3a44a9db-7993-413e-2022-08-29t18-30-00tz00-00-1 failed: failed to run spark-submit for SparkApplication spark/sch-3a44a9db-7993-413e-2022-08-29t18-30-00tz00-00-1: WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/spark/jars/spark-unsafe_2.12-3.1.1.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
22/08/30 19:41:04 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
22/08/30 19:41:08 INFO SparkKubernetesClientFactory: Auto-configuring K8S client using current context from users K8S config file
22/08/30 19:41:36 INFO KerberosConfDriverFeatureStep: You have not specified a krb5.conf file locally or via a ConfigMap. Make sure that you have the krb5.conf locally on the driver image.
I0830 19:42:00.824101      10 controller.go:223] SparkApplication spark/sch-3a44a9db-7993-413e-2022-08-29t18-30-00tz00-00-1 was updated, enqueuing it
I0830 19:42:00.824213      10 controller.go:223] SparkApplication spark/sch-3a44a9db-7993-413e-2022-08-29t18-30-00tz00-00-1 was updated, enqueuing it
I0830 19:42:00.824904      10 controller.go:223] SparkApplication spark/sch-3a44a9db-7993-413e-2022-08-29t18-30-00tz00-00-1 was updated, enqueuing it
I0830 19:42:00.824802      10 controller.go:223] SparkApplication spark/sch-3a44a9db-7993-413e-2022-08-29t18-30-00tz00-00-1 was updated, enqueuing it
I0830 19:42:01.011831      10 controller.go:270] Ending processing key: "spark/sch-3a44a9db-7993-413e-2022-08-29t18-30-00tz00-00-1"
I0830 19:42:01.011938      10 controller.go:223] SparkApplication spark/sch-3a44a9db-7993-413e-2022-08-29t18-30-00tz00-00-1 was updated, enqueuing it
I0830 19:42:01.011995      10 controller.go:263] Starting processing key: "spark/sch-3a44a9db-7993-413e-2022-08-29t18-30-00tz00-00-1"
I0830 19:42:01.012207      10 controller.go:270] Ending processing key: "spark/sch-3a44a9db-7993-413e-2022-08-29t18-30-00tz00-00-1"

Solution

  • The issue was that the CPU/memory was not enough for the spark operator pod. For each submissions, a JVM is created inside the spark-operator pod. If it does not have enough resources, it will kill these JVMs, resulting in failed spark-submits.

    Fixed this by simply removing the limits on CPU and memory in the helm chart.

    The chart mentions the issue here -

    # Note, that each job submission will spawn a JVM within the Spark Operator Pod using "/usr/local/openjdk-11/bin/java -Xmx128m".
    # Kubernetes may kill these Java processes at will to enforce resource limits. When that happens, you will see the following error:
    # 'failed to run spark-submit for SparkApplication [...]: signal: killed' - when this happens, you may want to increase memory limits.
    resources: {}
      # limits:
      #   cpu: 100m
      #   memory: 300Mi
      # requests:
      #   cpu: 100m
      #   memory: 300Mi
    

    Even though it mentions that it will assign a JVM of 128m, the actual memory used for about 20 applications was only around 400mb. The CPU usage was about 1.5 cores.