Search code examples
scalaapache-sparkapache-spark-sqldatastax-enterprisecassandra-2.1

Count of List values in spark - dataframe


In cassandra I have a list column type. I am new to spark and scala, and have no idea where to start. In spark I want get count of each values, is it possible to do so. Below is the dataframe

+--------------------+------------+
|                  id|        data|
+--------------------+------------+
|53e5c3b0-8c83-11e...|      [b, c]|
|508c1160-8c83-11e...|      [a, b]|
|4d16c0c0-8c83-11e...|   [a, b, c]|
|5774dde0-8c83-11e...|[a, b, c, d]|
+--------------------+------------+

I want output as

+--------------------+------------+
|   value            |      count |
+--------------------+------------+
|a                   |      3     |
|b                   |      4     |
|c                   |      3     |
|d                   |      1     |
+--------------------+------------+

spark version: 1.4


Solution

  • Here you go :

    scala> val rdd = sc.parallelize(
      Seq(
        ("53e5c3b0-8c83-11e", Array("b", "c")),
        ("53e5c3b0-8c83-11e1", Array("a", "b")),
        ("53e5c3b0-8c83-11e2", Array("a", "b", "c")),
        ("53e5c3b0-8c83-11e3", Array("a", "b", "c", "d"))))
    // rdd: org.apache.spark.rdd.RDD[(String, Array[String])] = ParallelCollectionRDD[22] at parallelize at <console>:27
    
    scala> rdd.flatMap(_._2).map((_, 1)).reduceByKey(_ + _)
    // res11: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[21] at reduceByKey at <console>:30
    
    scala> rdd.flatMap(_._2).map((_,1)).reduceByKey(_ + _).collect
    // res16: Array[(String, Int)] = Array((a,3), (b,4), (c,3), (d,1))
    

    This is also actually quite easy with the DataFrame API :

    scala> val df = rdd.toDF("id", "data")
    // res12: org.apache.spark.sql.DataFrame = ["id": string, "data": array<string>]
    
    scala> df.select(explode($"data").as("value")).groupBy("value").count.show
    // +-----+-----+
    // |value|count|
    // +-----+-----+
    // |    d|    1|
    // |    c|    3|
    // |    b|    4|
    // |    a|    3|
    // +-----+-----+