I am trying to build a quick report using zeppelin notebook fetching data from DynamoDB with Apache Spark
The count is running fine but beyond that I am not able to run anything like
orders.take(1).foreach(println)
fails with the follwoing error:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0 in stage 5.0 (TID 5) had a not serializable result: org.apache.hadoop.io.Text
Serialization stack:
- object not serializable (class: org.apache.hadoop.io.Text, value: )
- field (class: scala.Tuple2, name: _1, type: class java.lang.Object)
- object (class scala.Tuple2, (,{<<A rec from DynamoDB as JSON>>}))
- element of array (index: 0)
- array (class [Lscala.Tuple2;, size 7)
How to fix this? I have tries to typecast the results but that failed:
asInstanceOf[Tuple2[Text, DynamoDBItemWritable]
so did the filter
orders.filter(_._1 != null)
I am planning to convert this to a DataFrame to register this as a temp table. Then I plan to run adhoc queries on this.
orders.map(t => t._2.getItem()).collect.foreach(println)
This project can read DynamoDB and create RDD/DataFrame out of it. https://github.com/traviscrawford/spark-dynamodb