Search code examples
streamapache-storm

Unbalanced data received in Storm


I find there is some unbalanced data flow in my topology. The amount of data received from a stream is much less than that of another bolt which receives data from the same stream. Here is my topology:

        // Create 3 spouts for APMM APCC APSM respectively
        builder.setSpout(GlobalStorm.SPOUT_APMM, new UniversalApSpout(GlobalStorm.APMM));
        builder.setSpout(GlobalStorm.SPOUT_APCC, new UniversalApSpout(GlobalStorm.APCC));
        builder.setSpout(GlobalStorm.SPOUT_APSM, new UniversalApSpout(GlobalStorm.APSM));


        // Create moving point bolt connecting three streams above
        builder.setBolt(GlobalStorm.MovingPointMapBolt, new MovingPointMapBolt(), 9)
                .shuffleGrouping(GlobalStorm.SPOUT_APMM, GlobalStorm.STREAM_MM)
                .shuffleGrouping(GlobalStorm.SPOUT_APCC, GlobalStorm.STREAM_CC)
                .shuffleGrouping(GlobalStorm.SPOUT_APSM, GlobalStorm.STREAM_SM);

        // Real time bolt connecting APMM only
        builder.setBolt(GlobalStorm.RealTimeBolt, new RealTimeBolt(), 9).
                shuffleGrouping(GlobalStorm.SPOUT_APMM, GlobalStorm.STREAM_MM);

        // Redis bolt that saving data from moving point bolt and real time bolt together.
        builder.setBolt(GlobalStorm.RedisStoreBolt, new RedisStoreBolt(), 9)
                .shuffleGrouping(GlobalStorm.MovingPointMapBolt, GlobalStorm.STREAM_MOVING_POINT)
                .shuffleGrouping(GlobalStorm.RealTimeBolt, GlobalStorm.STREAM_REAL_TIME);

And here is Storm UI stats data: enter image description here SpoutApcc, SpoutApmm, SpoutApsm emit STREAM_APCC, STREAM_APMM, STREAM_APSM respectively. RealTimeBolt gets data from STREAM_APMM only and MovingPointMapBolt gets data from all three streams.

If everything is correct, The executed amount of RealTimeBolt should be equal to emitted amount of SpoutAPMM (or half of transferred amount of it). Even it may have machine performance problem, the proportion of execute number between MovingPointMapBolt and RealtimeBolt should be similar to the proportion of three streams's data amount.

However, the executed messages number of realtimebolt is much less than that of MovingpointmapBolt, which is only less than 1% of that.

So what is the reason of this problem?


Solution

  • After several days observing, I find the reason finally.

    The Stream named RealTime is stacked by the slow Redis io speed. In RealTimeBolt, there are some Redis IO operation which is required by business logic. The blocked messages impact the stream transferring before RedisBolt. On the contrary, MovingPointPoint part doesn't have component slowing down the stream speed, so it consumes almost all messages.