In the below code, taken from documentation:
package spendreport;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.walkthrough.common.sink.AlertSink;
import org.apache.flink.walkthrough.common.entity.Alert;
import org.apache.flink.walkthrough.common.entity.Transaction;
import org.apache.flink.walkthrough.common.source.TransactionSource;
public class FraudDetectionJob {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<Transaction> transactions = env
.addSource(new TransactionSource())
.name("transactions");
DataStream<Alert> alerts = transactions
.keyBy(Transaction::getAccountId)
.process(new FraudDetector())
.name("fraud-detector");
alerts
.addSink(new AlertSink())
.name("send-alerts");
env.execute("Fraud Detection");
}
}
Does .keyBy(Transaction::getAccountId)
function applied on data stream transactions
perform partitioning(based on account ID) on incoming data and perform process(new FraudDetector())
on each partition, as shown in below diagram?
2)
If yes, How to write a mapper to process data streamed from each partition? Using .keyBy(Transaction::getAccountId)
?
keyBy
is applied to datastream transactions
.
After applying keyBy
, records from transactions
with same account ID
will be in the same partition, and you can apply functions from KeyedStream
, like process
(not recommend as it is marked as deprecated), window
, reduce
, min/max/sum
, etc.
There are lots of example of using keyBy, e.g. this link