I use maven to manage my project. And I do add
to the maven dependencies
Below is my pom.xml
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
I package everything using:
mvn clean package
I submit my job locally by typing:
spark-submit --class ... <path to jar file> <arguments to run the main class>
But I will get an error saying:Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: kafka. Please find packages at http://spark.apache.org/third-party-projects.html
I know I can fix this problem by adding --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0
after the spark-submit.
But how can I modify my pom to advoid doing that? The thing is in my maven repo, I can see spark-sql-kafka-0-10_2.11-2.2.0.jar has been downloaded. Then why I need to add the dependency mannually during the spark submit? I feel like there might be some error in my pom.xml even though I use the assembly to build my jar.
Hope someone can help me out!
Finally I sloved my problem. I changed my pom.xml as follows:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
Basically I added a profiles section and add scope to each dependency.
Then instead of using mvn clean package
I used mvn clean install -Ponline -DskipTests
. And suprisingly, everything works perfect.
I am not quite clear about the details why this method work, but from the jar file I can see that the jar created by mvn clean package include lots of folders while the other method only includes a few. Maybe there are some conflict between folders in the first method. I don't know, hope some experienced people can explain this.