Search code examples
scalamavenapache-sparkjarorc

How to use orc-core-1.5.5 in Spark 2.3.3?


My codes rely on orc-core-1.5.5 and I need it to run on Spark-2.3.3environment. But Spark-2.3.3 has only orc-core-1.4.4.

Due to some reason, "--jars" is not allowed for my case. So I tried to use Maven Shade Plugin to add orc-core-1.5.5 into my final jar. But when I submit this jar to Spark-2.3.3, it still says java.lang.NoSuchMethodError: org.apache.orc.OrcFile$ReaderOptions.getUseUTCTimestamp()Z (which only exists in 1.5.5 version). It seems that my app didn't use orc-core-1.5.5 inside my jar, but search for this method in 1.4.4 in Spark environment.

the shading part in my pom:

            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>3.2.1</version>
                <executions>
                  <execution>
                    <phase>package</phase>
                    <goals>
                      <goal>shade</goal>
                    </goals>
                    <configuration>
                      <shadedArtifactAttached>true</shadedArtifactAttached>
                      <shadedClassifierName>cuda10</shadedClassifierName>
                      <artifactSet>
                        <includes>
                          <include>org.apache.orc:orc-core:nohive</include>
                        </includes>
                      </artifactSet>

                    </configuration>

                  </execution>
                </executions>
            </plugin>

After getting the jar, I dig into it and decompile the OrcFile.class. I can see the method "getUseUTCTimestamp()" lying there.

What's the order of the method search process? what can I do to use a method that only exists in orc-core-1.5.5, but on Spark-2.3.3?

Updated according to the answer, add relocations in configuration

                  <relocations>
                    <relocation>
                      <pattern>org.apache.orc</pattern>
                      <shadedPattern>org.shaded.apache.orc</shadedPattern>
                    </relocation>
                  </relocations>

But I'm getting a new error:

java.lang.NoClassDefFoundError: Could not initialize class org.shaded.apache.orc.impl.SnappyCodec
    at org.shaded.apache.orc.impl.WriterImpl.createCodec(WriterImpl.java:244)
    at org.shaded.apache.orc.impl.OrcCodecPool.getCodec(OrcCodecPool.java:55)
    at org.shaded.apache.orc.impl.ReaderImpl.extractFileTail(ReaderImpl.java:606)
......

I can see org/shaded/apache/orc lying in my jar.


Solution

  • You'll want to use the <relocation> directive of the maven shade plugin. This will change the "location" of your dependency so as to not conflict with the spark version.

    The shade plugin effectively moves your dependency into a different package location, and rewrites the bytecode of the rest of your project to use the changed fully qualified class names, so that it doesn't overlap with spark's dependencies, and allows both versions to exist simultaneously in the JVM.