Search code examples
apache-flink

The difference and benefit of JoinWithTiny, JoinWithHuge and joinHint


What would be the difference between using joinHint and joinWithTiny, joinWithHuge?

Regarding joinHint, we can use BROADCAST_HASH_FIRST: Hint that the first join input is much smaller than the second. REPARTITION_HASH_FIRST: Hint that the first join input is a bit smaller than the second.

Meanwhile, we can also use joinWithHuge and joinWithTiny

Are they the same? so joinWithTiny is using BROADCAST_HASH_FIRST?

The benefit of exploiting those is the Flink job saves the time to check the size of joining data?


Solution

  • Yes, DataSet.joinWithTiny(DataSet other) is a shortcut for DataSet.join(DataSet other, JoinHint.BROADCAST_HASH_SECOND) and DataSet.joinWithHuge(DataSet other) is a shortcut for DataSet.join(DataSet other, JoinHint.BROADCAST_HASH_FIRST).

    Apache Flink features a cost-based optimizer. Cost-based optimization requires estimating the input size of operators. This can be very difficult (or even impossible) in settings with user-defined functions, which are common in Flink programs. If Flink's optimizer is not able to obtain meaningful size estimates, it falls back to robust and scalable execution strategies such as repartioning instead of broadcasting. Optimizer hints allow the user to exactly specify the join strategy to use. This can help to improve the performance of a program if the user knows some properties about the data, which is processed.

    So optimizer hints are not about reducing the time to obtain estimates but to give the user full control over the way a Flink program is executed.