Search code examples
mavenapache-kafkaspark-structured-streaming

Spark 2.2.0 Streaming Failed to find data source: kafka


I use maven to manage my project. And I do add

  <groupId>org.apache.spark</groupId>
  <artifactId>spark-sql-kafka-0-10_2.11</artifactId>
  <version>2.2.0</version>

to the maven dependencies

Below is my pom.xml

  <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>***</groupId>
  <artifactId>***</artifactId>
  <version>1.0</version>
  <packaging>jar</packaging>

  <properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
  </properties>

  <build>
    <sourceDirectory>src/main/scala</sourceDirectory>

    <plugins>         
      <plugin>
        <groupId>org.scala-tools</groupId>
        <artifactId>maven-scala-plugin</artifactId>
        <version>2.11</version>
        <executions>
          <execution>
            <goals>
              <goal>compile</goal>
              <goal>testCompile</goal>
            </goals>
          </execution>
        </executions>
        <configuration>
          <scalaVersion>2.11.8</scalaVersion>
        </configuration>
      </plugin>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-compiler-plugin</artifactId>
        <version>2.0.2</version>
        <configuration>
          <source>1.8</source>
          <target>1.8</target>
        </configuration>
      </plugin>

    <plugin>
        <artifactId>maven-assembly-plugin</artifactId>
        <version>2.6</version>
        <groupId>org.apache.maven.plugins</groupId>
        <configuration>
            <appendAssemblyId>false</appendAssemblyId>
            <descriptorRefs>
                <descriptorRef>jar-with-dependencies</descriptorRef>
            </descriptorRefs>
            <archive>
                <manifest>
                    <mainClass>***</mainClass>
                </manifest>
            </archive>
        </configuration>
        <executions>
            <execution>
                <id>make-assembly</id>
                <phase>package</phase>
                <goals>
                    <goal>single</goal>
                </goals>
            </execution>
        </executions>
    </plugin>
    </plugins>

 </build>


  <dependencies>

    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-sql_2.11</artifactId>
      <version>2.2.0</version>
    </dependency>

    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-sql-kafka-0-10_2.11</artifactId>
      <version>2.2.0</version>
    </dependency>


    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
      <version>2.2.0</version>
    </dependency>

    <dependency>
      <groupId>org.scalaj</groupId>
      <artifactId>scalaj-http_2.11</artifactId>
      <version>2.2.0</version>
    </dependency>

  </dependencies>
</project>

I package everything using:

mvn clean package

I submit my job locally by typing:

spark-submit --class ... <path to jar file> <arguments to run the main class>

But I will get an error saying:Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: kafka. Please find packages at http://spark.apache.org/third-party-projects.html

I know I can fix this problem by adding --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0 after the spark-submit.

But how can I modify my pom to advoid doing that? The thing is in my maven repo, I can see spark-sql-kafka-0-10_2.11-2.2.0.jar has been downloaded. Then why I need to add the dependency mannually during the spark submit? I feel like there might be some error in my pom.xml even though I use the assembly to build my jar.

Hope someone can help me out!


Solution

  • Finally I sloved my problem. I changed my pom.xml as follows:

     <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
      <modelVersion>4.0.0</modelVersion>
    
      <groupId>***</groupId>
      <artifactId>***</artifactId>
      <version>1.0</version>
      <packaging>jar</packaging>
    
      <properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
          <spark.version>2.2.0</spark.version>
      </properties>
    
        <dependencies>
    
      <dependency>
          <groupId>org.apache.spark</groupId>
          <artifactId>spark-core_2.11</artifactId>
          <version>${spark.version}</version>
          <scope>${spark.scope}</scope>
      </dependency>
    
      <dependency>
          <groupId>org.apache.spark</groupId>
          <artifactId>spark-sql_2.11</artifactId>
          <version>${spark.version}</version>
          <scope>${spark.scope}</scope>
      </dependency>
      <dependency>
          <groupId>org.apache.spark</groupId>
          <artifactId>spark-sql-kafka-0-10_2.11</artifactId>
          <version>${spark.version}</version>
      </dependency>
    
      </dependencies>
    
    <profiles>
        <profile>
            <id>default</id>
            <properties>
                <profile.id>dev</profile.id>
                <spark.scope>compile</spark.scope>
            </properties>
            <activation>
                <activeByDefault>true</activeByDefault>
            </activation>
        </profile>
        <profile>
            <id>test</id>
            <properties>
                <profile.id>test</profile.id>
                <spark.scope>provided</spark.scope>
            </properties>
        </profile>
        <profile>
            <id>online</id>
            <properties>
                <profile.id>online</profile.id>
                <spark.scope>provided</spark.scope>
            </properties>
        </profile>
    </profiles>
    
        <build>
            <plugins>
                <plugin>
                    <groupId>org.apache.maven.plugins</groupId>
                    <artifactId>maven-compiler-plugin</artifactId>
                    <version>2.0.2</version>
                    <configuration>
                        <source>1.8</source>
                        <target>1.8</target>
                    </configuration>
                </plugin>
                <plugin>
                    <groupId>org.scala-tools</groupId>
                    <artifactId>maven-scala-plugin</artifactId>
                    <version>2.11</version>
                    <executions>
                        <execution>
                            <goals>
                                <goal>compile</goal>
                                <goal>testCompile</goal>
                            </goals>
                        </execution>
                    </executions>
                    <configuration>
                        <scalaVersion>2.11.8</scalaVersion>
                    </configuration>
                </plugin>
                <plugin>
                    <artifactId>maven-assembly-plugin</artifactId>
                    <configuration>
                        <appendAssemblyId>false</appendAssemblyId>
                        <descriptorRefs>
                            <descriptorRef>jar-with-dependencies</descriptorRef>
                        </descriptorRefs>
                        <archive>
                            <manifest>
                                <mainClass>***</mainClass>
                            </manifest>
                        </archive>
                    </configuration>
                    <executions>
                        <execution>
                            <id>make-assembly</id>
                            <phase>package</phase>
                            <goals>
                                <goal>single</goal>
                            </goals>
                        </execution>
                    </executions>
                </plugin>
            </plugins>
    
        </build>
    
    </project>
    

    Basically I added a profiles section and add scope to each dependency. Then instead of using mvn clean package I used mvn clean install -Ponline -DskipTests. And suprisingly, everything works perfect.

    I am not quite clear about the details why this method work, but from the jar file I can see that the jar created by mvn clean package include lots of folders while the other method only includes a few. Maybe there are some conflict between folders in the first method. I don't know, hope some experienced people can explain this.