Search code examples
apache-sparkcassandraapache-spark-sqlspark-cassandra-connector

Cassandra Spark Performing math operation on column value and saving


I'm working on a Cassandra Spark job where I need to find specific users who meet a certain condition, and then perform a math operation on a specific column, then save it to cassandra

For example I have the following dataset. I would like to perform math operation on age when certain conditions are met.

Keyspace: test_users Table: member

CREATE TABLE test_users.member (
    member_id bigint PRIMARY KEY,
    manually_entered boolean,
    member_age decimal,
    member_name text
)
 member_id | manually_entered | member_age | member_name
-----------+------------------+------------+------------------
         2 |            False |     25.544 |      Larry Smith
         3 |            False |    38.3214 |  Karen Dinglebop
         7 |             True |         10 |    Howard Jibble
         9 |             True |         10 |   Whitney Howard
         4 |             True |         60 |     Walter White
        10 |             True |         10 | Kevin Schmoggins
         8 |            False |     10.234 |     Brett Darrel
         5 |            False |      19.22 |    Kenny Loggins
         6 |             True |         10 |         Joe Dirt
         1 |            False |     56.232 |       Joe Schmoe

I'm trying to figure out how to use the column value within to perform a math function using org.apache.spark.sql round()

spark-shell  --packages com.datastax.spark:spark-cassandra-connector_2.11:2.0.0-M3

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
import org.apache.spark.sql.Encoder
import org.apache.spark.sql.expressions.Window
import spark.implicits._
import com.datastax.spark.connector._
import com.datastax.spark.connector.cql.CassandraConnector
import org.joda.time.LocalDate
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.functions.{round}
import org.apache.spark.sql.cassandra._
import org.apache.spark.sql.SQLContext


val members = spark.
  read.
  format("org.apache.spark.sql.cassandra").
  options(Map( "table" -> "test_users", "keyspace" -> "member" )).
  load()

var member_birthdays = members.select("member_id", "manually_entered", "member_age").
  where("manually_entered = false and member_age % 1 <> 0").
  withColumn("member_age", round(members['member_age'] * 5)) 

member_birthdays.write.
  format("org.apache.spark.sql.cassandra").
  mode("Append").
  options(Map( "table" -> "test_users", "keyspace" -> "member")).
  save()

I'm having trouble figuring out how to accomplish the task of performing a math operation, and using round() to update a specific field where a condition is met within spark cassandra.

Any insight would be greatly appreciated.


Solution

  • I updated the import of org.apache.spark.sql.function and used col('member_age') instead of members['member_age']. I was successfully able to update the column values and save.

    import org.apache.spark.sql.functions._
    
    var member_birthdays = members.select("member_id", "manually_entered", "member_age").
      where("manually_entered = false and member_age % 1 <> 0").
      withColumn("member_age", round(col('member_age') * 5))