Search code examples
apache-sparkpysparkspark-streamingspark-structured-streamingspark-submit

cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.sql.execution.datasources.v2.DataSourceRDD


I am trying to submit a pyspark job to k8s spark cluster using airflow. In that spark job I am using writestream foreachBatch function to write streaming data and irrespective of the sink type facing this issue only when I am trying to write data :

Inside spark cluster version: spark 3.3.0 pyspark 3.3 scala 2.12.15 OpenJDK 64-Bit Server VM,11.0.15

Inside airflow
spark version 3.1.2 pyspark 3.1.2 scala version 2.12.10 OpenJDK 64-Bit Server VM,1.8.0

dependencies: org.scala-lang:scala-library:2.12.8,org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.0,org.apache.spark:spark-sql_2.12:3.3.0,org.apache.spark:spark-core_2.12:3.3.0,org.postgresql:postgresql:42.3.3 .

Dag which I am using to submit is:

import airflow
from datetime import timedelta
from airflow import DAG
from time import sleep
from datetime import datetime
from airflow.providers.apache.spark.operators.spark_submit import SparkSubmitOperator

dag = DAG( dag_id = 'testpostgres.py', schedule_interval=None ,  start_date=datetime(2022, 1, 1), catchup=False)

spark_job = SparkSubmitOperator(application= '/usr/local/airflow/data/testpostgres.py',
                            conn_id= 'spark_kcluster',
                            task_id= 'spark_job_test',
                            dag= dag,
                            packages= "org.scala-lang:scala-library:2.12.8,org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.0,org.apache.spark:spark-sql_2.12:3.3.0,org.apache.spark:spark-core_2.12:3.3.0,org.postgresql:postgresql:42.3.3",
                            conf ={
                                   'deploy-mode' : 'cluster',
                                   'executor_cores' : 1,
                                   'EXECUTORS_MEM' : '2G',
                                   'name' : 'spark-py',
                                   'spark.kubernetes.namespace' : 'sandbox',
                                   'spark.kubernetes.file.upload.path' : '/usr/local/airflow/data',
                                   'spark.kubernetes.container.image' : '**********',
                                   'spark.kubernetes.container.image.pullPolicy' : 'IfNotPresent',
                                   'spark.kubernetes.authenticate.driver.serviceAccountName' : 'spark',
                                   'spark.kubernetes.driver.volumes.persistentVolumeClaim.rwopvc.options.claimName' : 'data-pvc',
                                   'spark.kubernetes.driver.volumes.persistentVolumeClaim.rwopvc.mount.path' : '/usr/local/airflow/data',
                                   'spark.driver.extraJavaOptions' : '-Divy.cache.dir=/tmp -Divy.home=/tmp'
                                  }

)

This is the job I am trying to submit :

from pyspark.sql.functions import *
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.functions import dayofweek
from pyspark.sql.functions import date_format
from pyspark.sql.functions import hour
from functools import reduce
from pyspark.sql.types import DoubleType, StringType, ArrayType
import pandas as pd
import json

spark = SparkSession.builder.appName('spark).getOrCreate()


kafka_topic_name = '****'
kafka_bootstrap_servers = '*********' + ':' + '*****'

streaming_dataframe = spark.readStream.format("kafka").option("kafka.bootstrap.servers", kafka_bootstrap_servers).option("subscribe", kafka_topic_name).option("startingOffsets", "earliest").load()
streaming_dataframe = streaming_dataframe.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")

dataframe_schema = '******'
streaming_dataframe = streaming_dataframe.select(from_csv(col("value"), dataframe_schema).alias("pipeline")).select("pipeline.*")

tumblingWindows = streaming_dataframe.withWatermark("timeStamp", "48 hour").groupBy(window("timeStamp", "24 hour", "1 hour"), "phoneNumber").agg((F.first(F.col("duration")).alias("firstDuration")))

tumblingWindows = tumblingWindows.withColumn("start_window", F.col('window')['start'])
tumblingWindows = tumblingWindows.withColumn("end_window", F.col('window')['end'])
tumblingWindows = tumblingWindows.drop('window')

def postgres_write(tumblingWindows, epoch_id):

    tumblingWindows.write.jdbc(url=db_target_url, table=table_postgres, mode='append', properties=db_target_properties)
pass

db_target_url = 'jdbc:postgresql://' + '*******'+ ':' + '****' + '/' + 'test'

table_postgres = '******'

db_target_properties = {
     'user': 'postgres',
     'password': 'postgres',
     'driver': 'org.postgresql.Driver'
}
query = tumblingWindows.writeStream.foreachBatch(postgres_write).start().awaitTermination()

Error logs:

Driver stacktrace:
      at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2672)
      at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2608)
      at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2607)
      at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
      at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
      at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
      at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2607)
      at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1182)
      at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1182)
      at scala.Option.foreach(Option.scala:407)
      at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1182)
      at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2860)
      at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2802)
      at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2791)
      at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
      at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:952)
      at org.apache.spark.SparkContext.runJob(SparkContext.scala:2228)
      at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.writeWithV2(WriteToDataSourceV2Exec.scala:377)
      ... 42 more
Caused by: java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.sql.execution.datasources.v2.DataSourceRDDPartition.inputPartitions of type scala.collection.Seq in instance of org.apache.spark.sql.execution.datasources.v2.DataSourceRDDPartition
      at java.base/java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(Unknown Source)
      at java.base/java.io.ObjectStreamClass$FieldReflector.checkObjectFieldValueTypes(Unknown Source)
      at java.base/java.io.ObjectStreamClass.checkObjFieldValueTypes(Unknown Source)
      at java.base/java.io.ObjectInputStream.defaultCheckFieldValues(Unknown Source)
      at java.base/java.io.ObjectInputStream.readSerialData(Unknown Source)
      at java.base/java.io.ObjectInputStream.readOrdinaryObject(Unknown Source)
      at java.base/java.io.ObjectInputStream.readObject0(Unknown Source)
      at java.base/java.io.ObjectInputStream.defaultReadFields(Unknown Source)
      at java.base/java.io.ObjectInputStream.readSerialData(Unknown Source)
      at java.base/java.io.ObjectInputStream.readOrdinaryObject(Unknown Source)
      at java.base/java.io.ObjectInputStream.readObject0(Unknown Source)
      at java.base/java.io.ObjectInputStream.readObject(Unknown Source)
      at java.base/java.io.ObjectInputStream.readObject(Unknown Source)
      at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:87)
      at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:129)
      at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:507)
      at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
      at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
      at java.base/java.lang.Thread.run(Unknown Source)
Traceback (most recent call last):
File "/usr/local/airflow/data/spark-upload-d03175bc-8c50-4baf-8383-a203182f16c0/debug.py", line 20, in <module>
  streaming_dataframe.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")\
File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/streaming.py", line 107, in awaitTermination
File "/opt/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__
File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 196, in deco
pyspark.sql.utils.StreamingQueryException: Query [id = d0e140c1-830d-49c8-88b7-90b82d301408, runId = c0f38f58-6571-4fda-b3e0-98e4ffaf8c7a] terminated with exception: Writing job aborted
22/08/24 10:12:53 INFO SparkUI: Stopped Spark web UI at ************************
22/08/24 10:12:53 INFO KubernetesClusterSchedulerBackend: Shutting down all executors
22/08/24 10:12:53 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each executor to shut down
22/08/24 10:12:53 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed.
22/08/24 10:12:53 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
22/08/24 10:12:53 INFO MemoryStore: MemoryStore cleared
22/08/24 10:12:53 INFO BlockManager: BlockManager stopped
22/08/24 10:12:53 INFO BlockManagerMaster: BlockManagerMaster stopped
22/08/24 10:12:53 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
22/08/24 10:12:54 INFO SparkContext: Successfully stopped SparkContext
22/08/24 10:12:54 INFO ShutdownHookManager: Shutdown hook called
22/08/24 10:12:54 INFO ShutdownHookManager: Deleting directory /var/data/spark-32ef85e0-e85c-4ac6-a46d-d3379ca58468/spark-adecf44a-dc60-4a85-bbe3-bc125f5cc39f/pyspark-f3ffaa5e-a490-464a-98d2-fbce223628eb
22/08/24 10:12:54 INFO ShutdownHookManager: Deleting directory /var/data/spark-32ef85e0-e85c-4ac6-a46d-d3379ca58468/spark-adecf44a-dc60-4a85-bbe3-bc125f5cc39f
22/08/24 10:12:54 INFO ShutdownHookManager: Deleting directory /tmp/spark-5acdd5e6-7f6e-45ec-adae-e98862e1537c```




Solution

  • I faced this issue recently. I think it occurs when shuffling data coming from Kafka. I fixed it by loading all dependencies(jars) of org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.0 to the project. You can find them here. For now, i dont know which ones are enough.