c#apache-spark apache-spark-sql azure-hdinsight .net-spark

Cannot use Spark.Net UDFs and HDInsight cluster

I have tried to run a simple application in prod env containing the code from https://github.com/dotnet/spark/blob/master/examples/Microsoft.Spark.CSharp.Examples/Sql/Batch/Basic.cs The applications runs fine and emits output to stdout until it this code crashes when it hits the first UDF. Thanks for any insights you can share on this.

Env. Code is packaged using

dotnet publish -c Release -f netcoreapp3.1 -r ubuntu.16.04-x64

HDInsight cluster HDI 4.0, Spark 2.4 -- Server is setup using the guidelines in https://learn.microsoft.com/en-us/dotnet/spark/tutorials/hdinsight-deployment

spark-submit --master yarn --conf spark.yarn.appMasterEnv.DOTNET_ASSEMBLY_SEARCH_PATHS="./app/publish.zip" --archives wasbs://xxx@yyy.blob.core.windows.net/SparkJobs/publish.zip#mySparkApp --class org.apache.spark.deploy.dotnet.DotnetRunner wasbs://xxx@yyy.blob.core.windows.net/SparkJobs/microsoft-spark-2.4.x-0.12.1.jar wasbs://xxx@yyy.blob.core.windows.net/SparkJobs/publish.zip mySparkApp

(and all sorts of variations on this in desperation, --deploy-mode cluster, various paths, etc, etc. nothing works)

stdout: ...

+---+-----+
|age| name|
+---+-----+
| 22|Ricky|
| 36| Jeff|
| 62|Geddy|
+---+-----+

[2020-10-28T09:15:10.1478641Z] [wn0-hdinsi] [Error] [JvmBridge] JVM method execution failed: Nonstatic method 'showString' failed for class '41' when called with 3 arguments ([Index=1, Type=Int32, Value=20], [Index=2, Type=Int32, Value=20], [Index=3, Type=Boolean, Value=False], )
[2020-10-28T09:15:10.1480587Z] [wn0-hdinsi] [Error] [JvmBridge] org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 16.0 failed 4 times, most recent failure: Lost task 0.3 in stage 16.0 (TID 210, wn0-hdinsi.xwccrqijnmqujdjghwrza0nzbb.fx.internal.cloudapp.net, executor 2): org.apache.spark.api.python.PythonException: System.NullReferenceException: Object reference not set to an instance of an object.
at Microsoft.Spark.Utils.UdfSerDe.<>c.b__10_0(TypeData td) in //src/csharp/Microsoft.Spark/Utils/UdfSerDe.cs:line 262
at System.Collections.Concurrent.ConcurrentDictionary2.GetOrAdd(TKey key, Func2 valueFactory)
at Microsoft.Spark.Utils.UdfSerDe.DeserializeType(TypeData typeData) in //src/csharp/Microsoft.Spark/Utils/UdfSerDe.cs:line 258
at Microsoft.Spark.Utils.UdfSerDe.Deserialize(UdfData udfData) in //src/csharp/Microsoft.Spark/Utils/UdfSerDe.cs:line 160
at Microsoft.Spark.Utils.CommandSerDe.DeserializeUdfs[T](UdfWrapperData data, Int32& nodeIndex, Int32& udfIndex) in //src/csharp/Microsoft.Spark/Utils/CommandSerDe.cs:line 333
at Microsoft.Spark.Utils.CommandSerDe.Deserialize[T](Stream stream, SerializedMode& serializerMode, SerializedMode& deserializerMode, String& runMode) in /_/src/csharp/Microsoft.Spark/Utils/CommandSerDe.cs:line 306
at Microsoft.Spark.Worker.Processor.CommandProcessor.ReadSqlCommands(PythonEvalType evalType, Stream stream) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\Processor\CommandProcessor.cs:line 188
at Microsoft.Spark.Worker.Processor.CommandProcessor.ReadSqlCommands(PythonEvalType evalType, Stream stream, Version version) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\Processor\CommandProcessor.cs:line 98
at Microsoft.Spark.Worker.Processor.CommandProcessor.Process(Stream stream) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\Processor\CommandProcessor.cs:line 43
at Microsoft.Spark.Worker.Processor.PayloadProcessor.Process(Stream stream) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\Processor\PayloadProcessor.cs:line 82
at Microsoft.Spark.Worker.TaskRunner.ProcessStream(Stream inputStream, Stream outputStream, Version version, Boolean& readComplete) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\TaskRunner.cs:line 143
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:456)
at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:81)
at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:64)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

------cut-------- My problem did indeed turn out to be an issue with paths. For anyone else having the same problem, I got this to work by having the dll with the UDF (can be the same dll as the general spark app) has to be listed in "--files". So essentially you need a zip file with assemblies AND then have links to the dlls directly. There is probably a smarter way, but that did it for me (when running on cluster mode): spark-submit --deploy-mode cluster --master yarn --files wasbs://xxx@yyy.blob.core.windows.net/SparkJobs/mySparkApp.dll --class org.apache.spark.deploy.dotnet.DotnetRunner wasbs://xxx@yyy.blob.core.windows.net/SparkJobs/microsoft-spark-2.4.x-0.12.1.jar wasbs://xxx@yyy.blob.core.windows.net/SparkJobs/publish.zip mySparkApp

Solution

The error is because the dll with your code can’t be found.

Two things, firstly in yarn mode . at the beginning of the DOTNET_ASSEMBLY_SEARCH_PATHS causes the users home directory to be prepended to the path so it isn’t currentdirectory/app/publish.zip so if that is different then it will be looking in the wrong place.

Secondly make sure the publish.zip doesn’t contain folders and the dll with the udf is at the top level of the zip.

Instead of putting the zip inside the app folder I would just use the current folder and don’t worry about DOTNET_ASSEMBLY_SEARCH_PATHS

For a walkthrough make sure you follow:

https://learn.microsoft.com/en-us/dotnet/spark/tutorials/hdinsight-deployment