Search code examples
apache-flinkflink-streaming

Flink, what's the behavior of minBy or maxBy if multiple records meet the condition


I'm newbie to Flink and I'm wondering what's the behavior of minBy (guess for the maxBy is the same) if there are multiple records that have the minimum value. I noticed that Flink will output only one record in this case, but which one? The first, the last or a random one?

Thanks for help.


Solution

  • Note that as of FLIP-134 all of these relational methods on DataStreams, namely Windowed/KeyedStream#sum,min,max,minBy,maxBy, are planned to be deprecated. The entire DataSet API is also planned to eventually be deprecated as well.

    The only long-term support for relational methods like these is what is provided by the Table and SQL APIs.

    But to answer your question, minBy and maxBy work the same way.

    The javadoc for DataSet#maxBy says

    If multiple values with maximum value at the specified fields exist, a random one will be picked.

    while the javadocs for AllWindowedStream#maxBy(int positionToMaxBy) and KeyedStream#maxBy(int positionToMaxBy) say

    If more elements have the same maximum value the operator returns the first by default.

    and the javadocs for AllWindowedStream#maxBy(int positionToMaxBy, boolean first) and AllWindowedStream#maxBy(int positionToMaxBy, boolean first) explain that

    If [first is] true, then the operator return the first element with the maximum value, otherwise returns the last