Search code examples
scalaapache-sparkapache-spark-datasetapache-spark-2.0

Scala Spark groupBy/Agg functions


I have two datasets that i need to join and perform operations on and I cant figure out how to do it. A stipulation for this is that i do not have org.apache.spark.sql.functions methods available to me, so must use the dataset API

The input given is two Datasets The first dataset is of type Customer with Fields: customerId, forename, surname - All String

And the second dataset is of Transaction: customerId (String), accountId(String), amount (Long)

customerId is the link

The outputted Dataset needs to have these fields: customerId (String), forename(String), surname(String), transactions( A list of type Transaction), transactionCount (int), totalTransactionAmount (Double),averageTransactionAmount (Double)

I understand that i need to use groupBy, agg, and some kind of join at the end. Can anyone help/point me in the right direction? Thanks


Solution

  • It is very hard to work with the information you have, but from what I understand you dont want to use the dataframe functions but implement everything with the dataset api, you could do this in the following way

    1. Join both the datasets using joinWith, you can find an example here https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-joins.html#joinWith

    2. Aggregating : I would use groupByKey followed by mapGroups something like

     ds.groupByKey(x=>x.id).mapGroups { case (key,iter) => { 
            val list = iter.toList
            val totalTransactionAmount = ???
            val averageTransactionAmount = ??? 
            (key,totalTransactionAmount,averageTransactionAmount)
       } 
     }
    

    Hopefully the example gives you an idea how you could solve your problem with the dataset API and you could adapt it to your problem.