I'm working on a Cassandra Spark job where I need to find specific users who meet a certain condition, and then perform a math operation on a specific column, then save it to cassandra
For example I have the following dataset. I would like to perform math operation on age when certain conditions are met.
Keyspace: test_users Table: member
CREATE TABLE test_users.member (
member_id bigint PRIMARY KEY,
manually_entered boolean,
member_age decimal,
member_name text
)
member_id | manually_entered | member_age | member_name
-----------+------------------+------------+------------------
2 | False | 25.544 | Larry Smith
3 | False | 38.3214 | Karen Dinglebop
7 | True | 10 | Howard Jibble
9 | True | 10 | Whitney Howard
4 | True | 60 | Walter White
10 | True | 10 | Kevin Schmoggins
8 | False | 10.234 | Brett Darrel
5 | False | 19.22 | Kenny Loggins
6 | True | 10 | Joe Dirt
1 | False | 56.232 | Joe Schmoe
I'm trying to figure out how to use the column value within to perform a math function using org.apache.spark.sql round()
spark-shell --packages com.datastax.spark:spark-cassandra-connector_2.11:2.0.0-M3
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
import org.apache.spark.sql.Encoder
import org.apache.spark.sql.expressions.Window
import spark.implicits._
import com.datastax.spark.connector._
import com.datastax.spark.connector.cql.CassandraConnector
import org.joda.time.LocalDate
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.functions.{round}
import org.apache.spark.sql.cassandra._
import org.apache.spark.sql.SQLContext
val members = spark.
read.
format("org.apache.spark.sql.cassandra").
options(Map( "table" -> "test_users", "keyspace" -> "member" )).
load()
var member_birthdays = members.select("member_id", "manually_entered", "member_age").
where("manually_entered = false and member_age % 1 <> 0").
withColumn("member_age", round(members['member_age'] * 5))
member_birthdays.write.
format("org.apache.spark.sql.cassandra").
mode("Append").
options(Map( "table" -> "test_users", "keyspace" -> "member")).
save()
I'm having trouble figuring out how to accomplish the task of performing a math operation, and using round()
to update a specific field where a condition is met within spark cassandra.
Any insight would be greatly appreciated.
I updated the import of org.apache.spark.sql.function and used col('member_age')
instead of members['member_age']
. I was successfully able to update the column values and save.
import org.apache.spark.sql.functions._
var member_birthdays = members.select("member_id", "manually_entered", "member_age").
where("manually_entered = false and member_age % 1 <> 0").
withColumn("member_age", round(col('member_age') * 5))