I created a Spark Scala project to test XGBoost4J-Spark. The project builds successfully but when I run the script I get this error:
Message: <console>:65: error: object dmlc is not a member of package org.apache.spark.ml
import ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier
I saw similar questions like this one where it seemed like adding mllib to the pom.xml file worked. However, I already have the dependency in it and it is throwing the error. Please advise.
Scala code:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.{DoubleType, StringType, StructField, StructType}
val spark = SparkSession.builder().config("spark.jars","target/accounts-by-state-1.0.jar").getOrCreate()
val schema = new StructType(Array(
StructField("sepal length", DoubleType, true),
StructField("sepal width", DoubleType, true),
StructField("petal length", DoubleType, true),
StructField("petal width", DoubleType, true),
StructField("class", StringType, true)))
val rawInput = spark.read.schema(schema).csv("iris.data")
import org.apache.spark.ml.feature.StringIndexer
val stringIndexer = new StringIndexer().
setInputCol("class").
setOutputCol("classIndex").
fit(rawInput)
val labelTransformed = stringIndexer.transform(rawInput).drop("class")
import org.apache.spark.ml.feature.VectorAssembler
val vectorAssembler = new VectorAssembler().
setInputCols(Array("sepal length", "sepal width", "petal length", "petal width")).
setOutputCol("features")
val xgbInput = vectorAssembler.transform(labelTransformed).select("features", "classIndex")
import ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier
val xgbParam = Map("eta" -> 0.1f,
"missing" -> -999,
"objective" -> "multi:softprob",
"num_class" -> 3,
"num_round" -> 100,
"num_workers" -> 2)
val xgbClassifier = new XGBoostClassifier(xgbParam).
setFeaturesCol("features").
setLabelCol("classIndex")
Message: <console>:65: error: object dmlc is not a member of package org.apache.spark.ml
import ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier
pom.xml file contents:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.cloudera.training.devsh</groupId>
<artifactId>accounts-by-state</artifactId>
<version>1.0</version>
<packaging>jar</packaging>
<name>"Accounts by State"</name>
<properties>
<hadoop.version>3.0.0</hadoop.version>
<spark.version>2.4.0</spark.version>
<scala.version>2.11.12</scala.version>
<scala.binary.version>2.11</scala.binary.version>
<java.version>1.8</java.version>
</properties>
<repositories>
<repository>
<id>XGBoost4J Snapshot Repo</id>
<name>XGBoost4J Snapshot Repo</name>
<url>https://s3-us-west-2.amazonaws.com/xgboost-maven-repo/snapshot/</url>
</repository>
<repository>
<id>apache-repo</id>
<name>Apache Repository</name>
<url>https://repository.apache.org/content/repositories/releases</url>
<releases>
<enabled>true</enabled>
</releases>
<snapshots>
<enabled>false</enabled>
</snapshots>
</repository>
<repository>
<id>cloudera-repo-releases</id>
<url>https://repository.cloudera.com/artifactory/cloudera-repos</url>
</repository>
</repositories>
<dependencies>
<dependency> <!-- Scala -->
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency> <!-- Core Spark -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency> <!-- Spark SQL -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_2.11</artifactId>
<version>${spark.version}</version>
<scope>runtime</scope>
</dependency>
<dependency> <!-- Hadoop -->
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>ml.dmlc</groupId>
<artifactId>xgboost4j_${scala.binary.version}</artifactId>
<version>1.1.0-SNAPSHOT</version>
</dependency>
<dependency>
<groupId>ml.dmlc</groupId>
<artifactId>xgboost4j-spark_${scala.binary.version}</artifactId>
<version>1.1.0-SNAPSHOT</version>
</dependency>
</dependencies>
<build>
<sourceDirectory>src/main/scala</sourceDirectory>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.3</version>
</plugin>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.2</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
<configuration>
<args>
<!-- work-around for https://issues.scala-lang.org/browse/SI-8358 -->
<arg>-nobootcp</arg>
</args>
</configuration>
</plugin>
</plugins>
</build>
</project>
EDIT:
To explain in more detail, I created a file called XgBoost.scala in /spark-application/xgboost-project/src/main/scala.xgboost/XgBoost.scala
import org.apache.spark.sql.SparkSession
import org.apache.spark.ml.feature.StringIndexer
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.sql.types._
import ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier._
//{StringType, LongType, StructField, StructType, DoubleType}
object XgBoost {
def main(args: Array[String]) {
if (args.length < 1) {
System.err.println("Usage: XGBOOST TEST")
System.exit(1)
}
val stateCode = args(0)
val spark = SparkSession.builder.getOrCreate()
spark.sparkContext.setLogLevel("WARN")
val schema = new StructType(Array(
StructField("sepal length", DoubleType, true),
StructField("sepal width", DoubleType, true),
StructField("petal length", DoubleType, true),
StructField("petal width", DoubleType, true),
StructField("class", StringType, true)))
val rawInput = spark.read.schema(schema).csv("~/iris.data")
val stringIndexer = new StringIndexer().
setInputCol("class").
setOutputCol("classIndex").
fit(rawInput)
val labelTransformed = stringIndexer.transform(rawInput).drop("class")
val vectorAssembler = new VectorAssembler().
setInputCols(Array("sepal length", "sepal width", "petal length", "petal width")).
setOutputCol("features")
val xgbInput = vectorAssembler.transform(labelTransformed).select("features", "classIndex")
val xgbParam = Map("eta" -> 0.1f,
"missing" -> -999,
"objective" -> "multi:softprob",
"num_class" -> 3,
"num_round" -> 100,
"num_workers" -> 2)
val xgbClassifier = new XGBoostClassifier(xgbParam).
setFeaturesCol("features").
setLabelCol("classIndex")
spark.stop
}
}
I then modified my pom.xml file as follows:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.cloudera.xgboost</groupId>
<artifactId>xgboost-project</artifactId>
<version>1.0</version>
<packaging>jar</packaging>
<name>"XGBoost Project"</name>
<properties>
<hadoop.version>3.0.0</hadoop.version>
<spark.version>2.4.0</spark.version>
<scala.version>2.11.12</scala.version>
<scala.binary.version>2.11</scala.binary.version>
<java.version>1.8</java.version>
</properties>
<repositories>
<repository>
<id>XGBoost4J Snapshot Repo</id>
<name>XGBoost4J Snapshot Repo</name>
<url>https://s3-us-west-2.amazonaws.com/xgboost-maven-repo/snapshot/</url>
</repository>
<repository>
<id>apache-repo</id>
<name>Apache Repository</name>
<url>https://repository.apache.org/content/repositories/releases</url>
<releases>
<enabled>true</enabled>
</releases>
<snapshots>
<enabled>false</enabled>
</snapshots>
</repository>
<repository>
<id>cloudera-repo-releases</id>
<url>https://repository.cloudera.com/artifactory/cloudera-repos</url>
</repository>
</repositories>
<dependencies>
<dependency> <!-- Scala -->
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency> <!-- Core Spark -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency> <!-- Spark SQL -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency> <!-- Hadoop -->
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>ml.dmlc</groupId>
<artifactId>xgboost4j_${scala.binary.version}</artifactId>
<version>1.1.0-SNAPSHOT</version>
</dependency>
<dependency>
<groupId>ml.dmlc</groupId>
<artifactId>xgboost4j-spark_${scala.binary.version}</artifactId>
<version>1.1.0-SNAPSHOT</version>
</dependency>
</dependencies>
<build>
<sourceDirectory>src/main/scala</sourceDirectory>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-jar-plugin</artifactId>
<configuration>
<archive>
<manifest>
<mainClass>${app.main.class}</mainClass>
</manifest>
</archive>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>3.3.0</version>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
<archive>
<manifest>
<mainClass>${app.main.class}</mainClass>
</manifest>
</archive>
</configuration>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.2</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
<configuration>
<args>
<!-- work-around for https://issues.scala-lang.org/browse/SI-8358 -->
<arg>-nobootcp</arg>
</args>
</configuration>
</plugin>
</plugins>
</build>
</project>
When I submit "mvn package" in the xgboost-project dir I get:
[INFO] Scanning for projects...
[WARNING]
[WARNING] Some problems were encountered while building the effective model for com.cloudera.xgboost:xgboost-project:jar:1.0
[WARNING] 'build.plugins.plugin.version' for org.apache.maven.plugins:maven-jar-plugin is missing. @ line 89, column 17
[WARNING]
[WARNING] It is highly recommended to fix these problems because they threaten the stability of your build.
[WARNING]
[WARNING] For this reason, future Maven versions might no longer support building such malformed projects.
[WARNING]
[INFO]
[INFO] ----------------< com.cloudera.xgboost:xgboost-project >----------------
[INFO] Building "XGBoost Project" 1.0
[INFO] --------------------------------[ jar ]---------------------------------
[INFO]
[INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ xgboost-project ---
[WARNING] Using platform encoding (UTF-8 actually) to copy filtered resources, i.e. build is platform dependent!
[INFO] skip non existing resourceDirectory /home/exercises/spark-application/xgboost-project/src/main/resources
[INFO]
[INFO] --- maven-compiler-plugin:3.1:compile (default-compile) @ xgboost-project ---
[INFO] Nothing to compile - all classes are up to date
[INFO]
[INFO] --- scala-maven-plugin:3.2.2:compile (default) @ xgboost-project ---
[WARNING] Expected all dependencies to require Scala version: 2.11.12
[WARNING] com.cloudera.xgboost:xgboost-project:1.0 requires scala version: 2.11.12
[WARNING] com.twitter:chill_2.11:0.9.3 requires scala version: 2.11.12
[WARNING] org.apache.spark:spark-core_2.11:2.4.0 requires scala version: 2.11.12
[WARNING] org.json4s:json4s-jackson_2.11:3.5.3 requires scala version: 2.11.11
[WARNING] Multiple versions of scala libraries detected!
[INFO] /home/exercises/spark-application/xgboost-project/src/main/scala:-1: info: compiling
[INFO] Compiling 1 source files to /home/cdsw/exercises/spark-application/xgboost-project/target/classes at 1599664489439
[ERROR] /home/exercises/spark-application/xgboost-project/src/main/scala/xgboost/XgBoost.scala:48: error: not found: type XGBoostClassifier
[ERROR] val xgbClassifier = new XGBoostClassifier(xgbParam).
[ERROR] ^
[ERROR] one error found
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 01:08 min
[INFO] Finished at: 2020-09-09T15:14:53Z
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile (default) on project xgboost-project:wrap: org.apache.commons.exec.ExecuteException: Process exited with an error: 1 (Exit value: 1) -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
Could this be a problem with my pom.xml file? It's complaining about multiple Scala versions now also... but mainly it doesn't seem to recognize the XGBoostClassifier object...
You need to provide XGBoost libraries when submitting the job - the easiest way to do it is to specify Maven coordinates via --packages
flag to spark-submit
, like this:
spark-submit --packages ml.dmlc:xgboost4j-spark_2.11:1.1.0-SNAPSHOT other options
Another possibility is to create a fat jar with all your dependencies - for example by using assembly or shard plugins, for example by adding following into plugins
section of your pom.xml:
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>3.2.0</version>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
and then use generated ...jar-with-dependencies.jar
file for execution
update: it looks like this happens because of using import ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier._
instead of import ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier
here is the compilable version of your code. One question is - do you really need to use 1.1.0-SNAPSHOT instead of stable 1.0.0?