Search code examples
hiveapache-flinkflink-sql

Why does flink-quickstart-scala suggests adding connector dependencies in the default scope, while Flink Hive integration docs suggest the opposite


Connector dependencies should be in default scope

This is what flink-quickstart-scala suggests:

        <!-- Add connector dependencies here. They must be in the default scope (compile). -->

        <!-- Example:
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-connector-kafka_${scala.binary.version}</artifactId>
            <version>${flink.version}</version>
        </dependency>
        -->

It also aligns with Flink project configuration:

We recommend packaging the application code and all its required dependencies into one jar-with-dependencies which we refer to as the application jar. The application jar can be submitted to an already running Flink cluster, or added to a Flink application container image.

Important: For Maven (and other build tools) to correctly package the dependencies into the application jar, these application dependencies must be specified in scope compile (unlike the core dependencies, which must be specified in scope provided).

Hive connector dependencies should be in provided scope

However, Flink Hive Integration docs suggests the opposite:

If you are building your own program, you need the following dependencies in your mvn file. It’s recommended not to include these dependencies in the resulting jar file. You’re supposed to add dependencies as stated above at runtime.

Why?


Solution

  • The reason for this difference is that for Hive it is recommended to start the cluster with the respective Hive dependencies. The documentation states that it's best to put the dependencies into the lib directory before you start the cluster. That way the cluster is enabled to run jobs which use Hive. At the same time, you don't have to bundle this dependency in the user jar which reduces its size. However, there shouldn't be anything preventing you from bundling the Hive dependency with your user code if you want to.