Search code examples
apache-spark-sqlazure-synapseazure-synapse-analytics

How troubleshoot Synapse Spark Notebook that runs manually, but times out if run from a Pipeline


  • Spark Notebook, if run manually from the notebook (whether under my credentials or running Managed Identity) runs successfully in about 3 minutes

  • If placed in a Pipeline - prior to Tuesday, it ran fine.

  • Since Tuesday, it fails when in a Pipeline.

  • It's using the same Spark Pool as when manually run successfully.

  • There were no updates to the notebook's code prior to the sudden failure

  • The committed version of the notebook is the same as the one that runs fine successfully.

It seems to time out after 2 hours on the last step (INSERT).

Error Message:

{
"errorCode": "6002",
"message": "Error: Job aborted.\norg.apache.spark.sql.errors.QueryExecutionErrors$.jobAbortedError(QueryExecutionErrors.scala:651)\norg.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:284)\norg.apache.spark.sql.delta.files.TransactionalWrite.$anonfun$writeFiles$1(TransactionalWrite.scala:456)\norg.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:111)\norg.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:183)\norg.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:97)\norg.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)\norg.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:66)\norg.apache.spark.sql.delta.files.TransactionalWrite.writeFiles(TransactionalWrite.scala:391)\norg.apache.spark.sql.delta.files.TransactionalWrite.writeFiles$(TransactionalWrite.scala:355)\norg.apache.spark.sql.delta.OptimisticTransaction.writeFiles(OptimisticTransaction.scala:103)\norg.apache.spark.sql.delta.files.TransactionalWrite.writeFiles(TransactionalWrite.scala:221)\norg.apache.spark.sql.delta.files.TransactionalWrite.writeFiles$(TransactionalWrite.scala:218)\norg.apache.spark.sql.delta.OptimisticTransaction.writeFiles(OptimisticTransaction.scala:103)\norg.apache.spark.sql.delta.commands.WriteIntoDelta.write(WriteIntoDelta.scala:337)\norg.apache.spark.sql.delta.commands.WriteIntoDelta.$anonfun$run$1(WriteIntoDelta.scala:98)\norg.apache.spark.sql.delta.commands.WriteIntoDelta.$anonfun$run$1$adapted(WriteIntoDelta.scala:91)\norg.apache.spark.sql.delta.DeltaLog.withNewTransaction(DeltaLog.scala:252)\norg.apache.spark.sql.delta.commands.WriteIntoDelta.run(WriteIntoDelta.scala:91)\norg.apache.spark.sql.delta.catalog.WriteIntoDeltaBuilder$$anon$1$$anon$2.insert(DeltaTableV2.scala:279)\norg.apache.spark.sql.execution.datasources.v2.SupportsV1Write.writeWithV1(V1FallbackWriters.scala:79)\norg.apache.spark.sql.execution.datasources.v2.SupportsV1Write.writeWithV1$(V1FallbackWriters.scala:78)\norg.apache.spark.sql.execution.datasources.v2.AppendDataExecV1.writeWithV1(V1FallbackWriters.scala:34)\norg.apache.spark.sql.execution.datasources.v2.V1FallbackWriters.run(V1FallbackWriters.scala:66)\norg.apache.spark.sql.execution.datasources.v2.V1FallbackWriters.run$(V1FallbackWriters.scala:65)\norg.apache.spark.sql.execution.datasources.v2.AppendDataExecV1.run(V1FallbackWriters.scala:34)\norg.apache.spark.sql.execution.datasources.v2.V2CommandExec.result$lzycompute(V2CommandExec.scala:43)\norg.apache.spark.sql.execution.datasources.v2.V2CommandExec.result(V2CommandExec.scala:43)\norg.apache.spark.sql.execution.datasources.v2.V2CommandExec.executeCollect(V2CommandExec.scala:49)\norg.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:113)\norg.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:111)\norg.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:183)\norg.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:97)\norg.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)\norg.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:66)\norg.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:113)\norg.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:106)\norg.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)\norg.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176)\norg.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584)\norg.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:31)\norg.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)\norg.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)\norg.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:31)\norg.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:31)\norg.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:560)\norg.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:106)\norg.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:90)\norg.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:84)\norg.apache.spark.sql.Dataset.<init>(Dataset.scala:231)\norg.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:101)\norg.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)\norg.apache.spark.sql.Dataset$.ofRows(Dataset.scala:98)\norg.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:622)\norg.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)\norg.apache.spark.sql.SparkSession.sql(SparkSession.scala:617)\norg.apache.livy.repl.SQLInterpreter.execute(SQLInterpreter.scala:129)\norg.apache.livy.repl.Session.$anonfun$executeCode$1(Session.scala:716)\nscala.Option.map(Option.scala:230)\norg.apache.livy.repl.Session.executeCode(Session.scala:713)\norg.apache.livy.repl.Session.$anonfun$execute$4(Session.scala:492)\norg.apache.livy.repl.Session.withRealtimeOutputSupport(Session.scala:936)\norg.apache.livy.repl.Session.$anonfun$execute$1(Session.scala:492)\nscala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)\nscala.concurrent.Future$.$anonfun$apply$1(Future.scala:659)\nscala.util.Success.$anonfun$map$1(Try.scala:255)\nscala.util.Success.map(Try.scala:213)\nscala.concurrent.Future.$anonfun$map$1(Future.scala:292)\nscala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33)\nscala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33)\nscala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)\njava.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\njava.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\njava.lang.Thread.run(Thread.java:750)",
"failureType": "UserError",
"target": "Rebuild_factGL",
"details": []

I've tried:

  1. Assigning the pipeline a larger Spark Pool
  2. Placing the notebook in a brand new pipeline by itself
  3. Recreating the notebook, and running that from a pipeline

How can I troubleshoot better?


Solution

  • Solved-ish.

    TLDR: Use PARSER_VERSION = '2.0' when doing OPENROWSET on CSV files.

    Approximately one week after the Spark Notebook stopped working when run in a Pipeline, it also stopped working when ran manually. I'm not sure why the delayed reaction there, but now it was being consistent.

    I isolated it back to a SQL Serverless View the Spark Notebook was calling.

    That view had suddenly gone from running in 1.5 minutes, to timing out.

    In particluar, the View would time out if 3 conditions were met - calling both numeric columns, and including a where clause on date. If 2 conditions were met - it ran in 1.5 minutes. All 3 = a time out.

    This was reproducable behavior for 3 weeks.

    I never was able to troubleshoot what exactly was occuring. But, all the underlying tables of the View, ultimately, were coming from CSV viles in the datalake. I changed all of those to PARSER_VERSION = '2.0', and now it runs, and faster than it did prior to the hiccup.