Search code examples
scalaamazon-web-servicesapache-sparkamazon-dynamodbapache-zeppelin

Error in reading DynamoDB record from Spark


I am trying to build a quick report using zeppelin notebook fetching data from DynamoDB with Apache Spark

The count is running fine but beyond that I am not able to run anything like

orders.take(1).foreach(println)

fails with the follwoing error:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0 in stage 5.0 (TID 5) had a not serializable result: org.apache.hadoop.io.Text
Serialization stack:
- object not serializable (class: org.apache.hadoop.io.Text, value: )
- field (class: scala.Tuple2, name: _1, type: class java.lang.Object)
- object (class scala.Tuple2, (,{<<A rec from DynamoDB as JSON>>}))
- element of array (index: 0)
- array (class [Lscala.Tuple2;, size 7)

How to fix this? I have tries to typecast the results but that failed:

 asInstanceOf[Tuple2[Text, DynamoDBItemWritable]

so did the filter

 orders.filter(_._1 != null)

I am planning to convert this to a DataFrame to register this as a temp table. Then I plan to run adhoc queries on this.


Solution

  • orders.map(t => t._2.getItem()).collect.foreach(println)

    This project can read DynamoDB and create RDD/DataFrame out of it. https://github.com/traviscrawford/spark-dynamodb