I've read some article, benchmarking the performance of stream processing engines like Spark streaming, Storm, and Flink. In the evaluation part, the criterion was 99th percentile and throughput. For example, Apache Kafka sent data at around 100.000 events per seconds and those three engines act as stream processor and their performance was described using 99th percentile latency and throughput.
Can anyone clarify these two criteria for me?
99th percentile latency of X milliseconds in stream jobs means that 99% of the items arrived at the end of the pipeline in less than X milliseconds. Read this reference for more details.
When application developers expect a certain latency, they often need a latency bound. We measure several latency bounds for the stream record grouping job which shuffles data over the network. The following figure shows the median latency observed, as well as the 90-th, 95-th, and 99-th percentiles (a 99-th percentile of latency of 50 milliseconds, for example, means that 99% of the elements arrive at the end of the pipeline in less than 50 milliseconds).