Search code examples
javastreamapache-flinkflink-streaming

Flink - Does keyBy() functon partition the flink operator internally?


In the below code, taken from documentation:

package spendreport;

import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.walkthrough.common.sink.AlertSink;
import org.apache.flink.walkthrough.common.entity.Alert;
import org.apache.flink.walkthrough.common.entity.Transaction;
import org.apache.flink.walkthrough.common.source.TransactionSource;

public class FraudDetectionJob {

    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        DataStream<Transaction> transactions = env
            .addSource(new TransactionSource())
            .name("transactions");
        
        DataStream<Alert> alerts = transactions
            .keyBy(Transaction::getAccountId)
            .process(new FraudDetector())
            .name("fraud-detector");

        alerts
            .addSink(new AlertSink())
            .name("send-alerts");

        env.execute("Fraud Detection");
    }
}

Does .keyBy(Transaction::getAccountId) function applied on data stream transactions perform partitioning(based on account ID) on incoming data and perform process(new FraudDetector()) on each partition, as shown in below diagram?

2) If yes, How to write a mapper to process data streamed from each partition? Using .keyBy(Transaction::getAccountId)?

enter image description here



Solution

  • keyBy is applied to datastream transactions.

    After applying keyBy, records from transactions with same account ID will be in the same partition, and you can apply functions from KeyedStream, like process(not recommend as it is marked as deprecated), window, reduce, min/max/sum, etc.

    There are lots of example of using keyBy, e.g. this link