Search code examples
javaapache-sparkpysparkapache-zeppelin

Zeppelin Error When Running Pyspark Script


I am attempting to run AWS glue jobs using a development endpoint and am running into this error: 

org.apache.thrift.transport.TTransportException

Only thing I have done out of the box is specified here: https://docs.aws.amazon.com/glue/latest/dg/dev-endpoint-tutorial-local-notebook.html

Start command: 

$sudo bash
$./zeppelin-daemon.sh start

java version: openjdk version "1.8.0_222" zeppelin version: Version 0.7.3 MacOS Version: 10.14.6

Code: 

%pyspark
import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.transforms import *

# Create a Glue context
glueContext = GlueContext(SparkContext.getOrCreate())

error log: 

INFO [2019-09-24 11:12:26,138] ({pool-2-thread-2} SchedulerFactory.java[jobStarted]:131) - Job paragraph_1569298185348_-1564147927 started by scheduler org.apache.zeppelin.interpreter.remote.RemoteInterpreterexisting_process458983990
INFO [2019-09-24 11:12:26,138] ({pool-2-thread-2} Paragraph.java[jobRun]:362) - run paragraph 20190924-000945_1425987694 using pyspark org.apache.zeppelin.interpreter.LazyOpenInterpreter@67a01c6d
ERROR [2019-09-24 11:12:26,154] ({pool-2-thread-2} Job.java[run]:188) - Job failed
org.apache.zeppelin.interpreter.InterpreterException: org.apache.thrift.transport.TTransportException
at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.interpret(RemoteInterpreter.java:401)
at org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:97)
at org.apache.zeppelin.notebook.Paragraph.jobRun(Paragraph.java:406)
at org.apache.zeppelin.scheduler.Job.run(Job.java:175)
at org.apache.zeppelin.scheduler.RemoteScheduler$JobRunner.run(RemoteScheduler.java:329)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.thrift.transport.TTransportException
at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)
at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)
at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
at org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.recv_interpret(RemoteInterpreterService.java:266)
at org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.interpret(RemoteInterpreterService.java:250)
at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.interpret(RemoteInterpreter.java:373)
... 11 more
ERROR [2019-09-24 11:12:26,239] ({pool-2-thread-2} RemoteScheduler.java[getStatus]:281) - Unknown status
java.lang.IllegalArgumentException: No enum constant org.apache.zeppelin.scheduler.Job.Status.UNKNOWN
at java.lang.Enum.valueOf(Enum.java:238)
at org.apache.zeppelin.scheduler.Job$Status.valueOf(Job.java:51)
at org.apache.zeppelin.scheduler.RemoteScheduler$JobStatusPoller.getStatus(RemoteScheduler.java:271)
at org.apache.zeppelin.scheduler.RemoteScheduler$JobRunner.run(RemoteScheduler.java:342)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
ERROR [2019-09-24 11:12:26,240] ({pool-2-thread-2} NotebookServer.java[afterStatusChange]:2056) - Error
org.apache.zeppelin.interpreter.InterpreterException: org.apache.thrift.transport.TTransportException
at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.interpret(RemoteInterpreter.java:401)
at org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:97)
at org.apache.zeppelin.notebook.Paragraph.jobRun(Paragraph.java:406)
at org.apache.zeppelin.scheduler.Job.run(Job.java:175)
at org.apache.zeppelin.scheduler.RemoteScheduler$JobRunner.run(RemoteScheduler.java:329)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.thrift.transport.TTransportException
at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)
at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)
at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
at org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.recv_interpret(RemoteInterpreterService.java:266)
at org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.interpret(RemoteInterpreterService.java:250)
at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.interpret(RemoteInterpreter.java:373)
... 11 more
WARN [2019-09-24 11:12:26,240] ({pool-2-thread-2} NotebookServer.java[afterStatusChange]:2064) - Job 20190924-000945_1425987694 is finished, status: ERROR, exception: org.apache.zeppelin.interpreter.InterpreterException: org.apache.thrift.transport.TTransportException, result: org.apache.thrift.transport.TTransportException
INFO [2019-09-24 11:12:26,263] ({pool-2-thread-2} SchedulerFactory.java[jobFinished]:137) - Job paragraph_1569298185348_-1564147927 finished by scheduler org.apache.zeppelin.interpreter.remote.RemoteInterpreterexisting_process458983990

Spark interpreter config: 

https://i.sstatic.net/K92p8.jpg

What has not worked:

  • Reinstalling zeppelin
  • Restarting zeppelin
  • Installing apache spark and setting SPARK_HOME to /usr/local/Cellar/apache-spark/2.4.4/
  • Adding yarn.. parameters to the spark interpreter config

Solution

  • Turns out, this was an issue on the AWS Glue team's end. This issue should be resolved.