Search code examples
apache-sparkpysparkdatastaxdatastax-enterprisedatastax-startup

Why does Spark 1.6.2 RPC Error message occur?


My script is written in python, it was working well on DSE 4.8 without docker environment. Now I upgraded to DSE 5.0.4 and run it in a docker environment and now I got the below RPC error. Before I used DSE Spark version 1.4.1 now I am using 1.6.2.

Host OS Centos 7.2 and Docker OS is the same. We use spark to submit a task and we tried giving executors 2G, 4G, 6G and 8G and they all give the same error message.

The same python script ran without issues in my previous environment but now that I updated it doesn't work properly.

For the scala operations the code runs normal in the current environment, only the python part has the issue. Resetting the hosts still hasn't resolved the issue. Recreating the docker container also didn't help solving the issue.

EDIT:

Maybe my Mapreduce function is too complicated. The issue might be here but not sure.

Specs of environment: Cluster group by 6 host, every host has 16 cores CPU, 32G memory, 500G SSD。

Any idea how to fix this issue? Also what does this error message mean? many thanks! Let me know if you need more info.

Error log:

Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
WARN  2017-02-26 10:14:08,314 org.apache.spark.scheduler.TaskSetManager: Lost task 47.1 in stage 88.0 (TID 9705, 139.196.190.79): TaskKilled (killed intentionally)
Traceback (most recent call last):
  File "/data/user_profile/User_profile_step1_classify_articles_common_sc_collect.py", line 1116, in <module>
    compute_each_dimension_and_format_user(article_by_top_all_tmp)
  File "/data/user_profile/User_profile_step1_classify_articles_common_sc_collect.py", line 752, in compute_each_dimension_and_format_user
    sqlContext.createDataFrame(article_up_save_rdd, df_schema).write.format('org.apache.spark.sql.cassandra').options(keyspace='archive', table='articles_up_update').save(mode='append')
  File "/opt/dse-5.0.4/resources/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 395, in save
WARN  2017-02-26 10:14:08,336 org.apache.spark.scheduler.TaskSetManager: Lost task 63.1 in stage 88.0 (TID 9704, 139.196.190.79): TaskKilled (killed intentionally)
  File "/opt/dse-5.0.4/resources/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
  File "/opt/dse-5.0.4/resources/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 45, in deco
  File "/opt/dse-5.0.4/resources/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o795.save.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 619 in stage 88.0 failed 4 times, most recent failure: Lost task 619.3 in stage 88.0 (TID 9746, 139.196.107.73): ExecutorLostFailure (executor 59 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
Driver stacktrace:
 at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
 at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
 at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
 at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
 at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
 at org.apache.spark.scheduler.DAGScheduler$$anonfun$han

Docker command:

docker run -d --net=host -i --privileged \ 
    -e SEEDS=10.XX.XXx.XX 1,10.XX.XXx.XXX \ 
    -e CLUSTER_NAME="MyCluster" \ 
    -e LISTEN_ADDRESS=10.XX.XXx.XX \  
    -e BROADCAST_RPC_ADDRESS=139.XXX.XXX.XXX \ 
    -e RPC_ADDRESS=0.0.0.0 \
    -e STOMP_INTERFACE=10.XX.XXx.XX \ 
    -e HOSTS=139.XX.XXx.XX \ 
    -v /data/dse/lib/cassandra:/var/lib/cassandra \ 
    -v /data/dse/lib/spark:/var/lib/spark \ 
    -v /data/dse/log/cassandra:/var/log/cassandra \ 
    -v /data/dse/log/spark:/var/log/spark \
    -v /data/agent/log:/opt/datastax-agent/log \
    --name dse_container registry..xxx.com/rechao/dse:5.0.4 -s

Solution

  • docker is fine, increase host memory to 64G can fix this issue.