scalac problem1.scala -d problem1.jar
Error:
problem1.scala:3: error: object apache is not a member of package org
import org.apache.spark.SparkContext
Code:
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.log4j.{Logger,Level}
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions.lit
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types.{StructType, StructField, LongType, StringType}
//import org.apache.parquet.format.StringType
object problem1 {
def main(args: Array[String]) {
Logger.getLogger("org").setLevel(Level.OFF)
//Create conf object
val conf = new SparkConf().setMaster("local[2]").setAppName("loadData")
//create spark context object
val sc = new SparkContext(conf)
val SQLContext = new SQLContext(sc)
import SQLContext.implicits._
//Read file and create RDD
val table_schema = StructType(Seq(
StructField("TransID", LongType, true),
StructField("CustID", LongType, true),
StructField("TransTotal", LongType, true),
StructField("TransNumItems", LongType, true),
StructField("TransDesc", StringType, true)
))
val T = SQLContext.read
.format("csv")
.schema(table_schema)
.option("header","false")
.option("nullValue","NA")
.option("delimiter",",")
.load(args(0))
// T.show(5)
val T1 = T.filter($"TransTotal" >= 200)
// T1.show(5)
val T2 = T1.groupBy("TransNumItems").agg(sum("TransTotal"), avg("TransTotal"),
min("TransTotal"), max("TransTotal"))
// T2.show(500)
T2.show()
val T3 = T1.groupBy("CustID").agg(count("TransID").as("number_of_transactions_T3"))
// T3.show(50)
val T4 = T.filter($"TransTotal" >= 600)
// T4.show(5)
val T5 = T4.groupBy("CustID").agg(count("TransID").as("number_of_transactions_T5"))
// T5.show(50)
val temp = T3.as("T3").join(T5.as("T5"), ($"T3.CustID" === $"T5.CustID") )
// T6.show(5)
// print(T6.count())
val T6 = temp.where(($"number_of_transactions_T5")*5 < $"number_of_transactions_T3")
// T6.show(5)
T6.show()
sc.stop
}
}
Why not to choose a Docker image with sbt?
Anyway, yes, surely you can create a jar from command line using pure Scala without sbt. You should have dependency jars (spark-core
, spark-catalyst
, spark-sql
, log4j
, maybe some others if needed) and specify classpath manually
scalac -cp /path/to/spark-core_2.13-3.3.1.jar:/path/to/spark-catalyst_2.13/3.3.1/spark-catalyst_2.13-3.3.1.jar:/path/to/spark-sql_2.13/3.3.1/spark-sql_2.13-3.3.1.jar:/path/to/log4j-1.2-api-2.17.2.jar -d problem1.jar problem1.scala
For example for me the path/to
is the following:
scalac -cp /home/dmitin/.cache/coursier/v1/https/repo1.maven.org/maven2/org/apache/spark/spark-core_2.13/3.3.1/spark-core_2.13-3.3.1.jar:/home/dmitin/.cache/coursier/v1/https/repo1.maven.org/maven2/org/apache/spark/spark-catalyst_2.13/3.3.1/spark-catalyst_2.13-3.3.1.jar:/home/dmitin/.cache/coursier/v1/https/repo1.maven.org/maven2/org/apache/spark/spark-sql_2.13/3.3.1/spark-sql_2.13-3.3.1.jar:/home/dmitin/.cache/coursier/v1/https/repo1.maven.org/maven2/org/apache/logging/log4j/log4j-1.2-api/2.17.2/log4j-1.2-api-2.17.2.jar -d problem1.jar problem1.scala
sbt assembly
) with all dependencies (or even with your application and all dependencies) and use itscalac -cp fat-jar.jar -d problem1.jar problem1.scala
https://github.com/sbt/sbt-assembly
https://www.scala-sbt.org/1.x/docs/Sbt-Launcher.html
SBT gives java.lang.NullPointerException when trying to run spark
Sbt launcher helps to run application in environments with only Java installed.
Can you import a separate version of the same dependency into one build file for test?
How to compile and execute scala code at run-time in Scala3?