Search code examples
scalaapache-sparksbtlivy

How rewrite spark scala code to use it in apache livy


i rewrite this code:

import org.apache.spark.sql.SparkSession

object SimpleApp {
  def main(args: Array[String]) {
    val logFile = "file:///root/spark/README.md"
    val spark = SparkSession.builder.appName("Simple Application").getOrCreate()
    val logData = spark.read.textFile(logFile).cache()
    val numAs = logData.filter(line => line.contains("a")).count()
    val numBs = logData.filter(line => line.contains("b")).count()
    println(s"Lines with a: $numAs, Lines with b: $numBs")
    spark.stop()
  }
}

to this:

import org.apache.livy._
import org.apache.spark.sql.SparkSession

class Test extends Job[Int]{

  override def call(jc: JobContext): Int = {
  
    val spark = jc.sparkSession()

    val logFile = "file:///root/spark/README.md"
    val logData = spark.read.textFile(logFile).cache()
    val numAs = logData.filter(line => line.contains("a")).count()
    val numBs = logData.filter(line => line.contains("b")).count()
    println(s"Lines with a: $numAs, Lines with b: $numBs")
    
    1 //Return value
  }
}

but when compile it with sbt val spark not recognized correctly and i received error "value read is not a member of Nothing"

also after comment spark related code when i try to run resulted jar file using /batches i received error "java.lang.NoSuchMethodException: Test.main([Ljava.lang.String;)"

please any body can show correct spark scala code rewriting way?


Solution

  • There's no need to rewrite your Spark application in order to use Livy. Instead, you can use its REST interface to submit jobs on a cluster that has a running livy server, retrieve logs, get job state, etc.

    As a practical example, here are instructions to run your application on AWS.

    Setup:

    1. Use AWS EMR to create a Spark cluster that has Spark, Livy and any other preinstalled applications you need for your application.
    2. Upload your JAR to AWS S3.
    3. Make sure that the security group attached to your cluster has an inbound rule that whitelists your IP on port 8998 (Livy's port).
    4. Make sure that your cluster has access to S3 in order to fetch the JAR.

    Now you'll be able to issue a POST request using cURL (or any equivalent) to submit your application:

    curl -H "Content-Type: application/json" -X POST --data '{"className":"<your-package-name>.SimpleApp","file":"s3://<path-to-your-jar>"}' http://<cluster-domain-name>:8998/batches