scala apache-spark cassandra spark-streaming spark-cassandra-connector

Scala Spark Filter RDD using Cassandra

I am new to spark-Cassandra and Scala. I have an existing RDD. let say:

((url_hash, url, created_timestamp )).

I want to filter this RDD based on url_hash. If url_hash exists in the Cassandra table then I want to filter it out from the RDD so I can do processing only on the new urls.

Cassandra Table looks like following:

 url_hash| url | created_timestamp | updated_timestamp

Any pointers will be great.

I tried something like this this:

   case class UrlInfoT(url_sha256: String, full_url: String, created_ts: Date)
   def timestamp = new java.utils.Date()
   val rdd1 = rdd.map(row => (calcSHA256(row(1)), (row(1), timestamp)))
   val rdd2 = sc.cassandraTable[UrlInfoT]("keyspace", "url_info").select("url_sha256", "full_url", "created_ts")
   val rdd3 = rdd2.map(row => (row.url_sha256,(row.full_url, row.created_ts)))
   newUrlsRDD = rdd1.subtractByKey(rdd3)

I am getting cassandra error

java.lang.NullPointerException: Unexpected null value of column full_url in      keyspace.url_info.If you want to receive null values from Cassandra, please wrap the column type into Option or use JavaBeanColumnMapper

There are no null values in cassandra table

Solution

Thanks The Archetypal Paul!

I hope somebody finds this useful. Had to add Option to case class.

Looking forward to better solutions

case class UrlInfoT(url_sha256: String, full_url: Option[String], created_ts: Option[Date])

def timestamp = new java.utils.Date()
val rdd1 = rdd.map(row => (calcSHA256(row(1)), (row(1), timestamp)))
val rdd2 = sc.cassandraTable[UrlInfoT]("keyspace",   "url_info").select("url_sha256", "full_url", "created_ts")
val rdd3 = rdd2.map(row => (row.url_sha256,(row.full_url, row.created_ts)))
newUrlsRDD = rdd1.subtractByKey(rdd3)