Search code examples
amazon-web-servicesmqttmosquittoaws-iot

Very slow delivery of queued stored messages on Eclipse Mosquitto


I am running Mosquitto as a Docker container, version 2.0.14 (image: eclipse-mosquitto:2.0.14). Intentionally not running 2.0.15 as that one has a current regression that is affecting us.

I have created a bridge to AWS, following the standard documentation provided by Amazon.

My config looks like this:

#  Bridged topics
topic root/topic/# out 1

# Setting protocol version explicitly
bridge_protocol_version mqttv311
bridge_insecure false

# Bridge connection name and MQTT client Id, enabling the connection automatically when the broker starts.
cleansession false
clientid bridgeawsiot
start_type automatic
notifications false
log_type all
restart_timeout 10 30

I am testing the behaviour of mosquitto for when there is network interruption. We want to test this on open fields where we anticipate network issues with potential prolonged periods of disconnection (several hours up to a couple of days).

We have enabled persistence of messages as well, this are relevant settings:

max_inflight_bytes 0
max_inflight_messages 0
max_queued_bytes 1073741824
max_queued_messages 100000
persistent_client_expiration 7d
listener 1883
autosave_interval 10
persistence true
persistence_file mosquitto.db
persistence_location /mqtt/data

On the AWS side of things, we have MongoDB ingesting data as time series. We have a stable deterministic approach to collect telemetry, so the number of metrics per minute is stable. I am sharing a graph of what the data ingestion looks like:

Histogram of samples per minute

The queue in mosquitto seems to just keep growing. It doesn't seem to decrease once connectivity is re-established (I am simulating disconnection by just turning off my wifi). When I see what the $SYS/broker/store/messages/count topic shows, the numbers mostly increase. When I debug the content in the mosquitto.db (link1, link2) I don't see much details but I can observe things like this:

DB_CHUNK_MSG_STORE:
        Length: 4853
        Store ID: 59572
        Source Port: 1883
        Source MID: 7276
        Topic: some/topic/here
        QoS: 1
        Retain: 1
        Payload Length: 4706
        Expiry Time: 0

I have observed that eventually some data does comes. The graphs suddenly start to fill up, but very slowly, after hours we may get some data points "from the past".

What I am wondering now is: is mosquitto designed to handle long periods of disconnection? Are we using the right tool for the job here? maybe is just a matter of us having incorrectly configured it, if so, can someone point us in a better direction?


Solution

  • I think I have narrowed down the issue. I am not 100% confident about this, but I have done a few trials and I think this the root cause is a combination of factors.

    In short, is not a problem on AWS IoT and when is well configured, is not an issue on Mosquitto either. The problem seems to be related to having the publisher of events, flagging events as retained = true.

    In addition to this though, it seems that the reason why I didn't notice this previously is because I also have enabled the persistence DB option (persistence true). When I changed the MQTT publisher to stop flagging messages, if a previous mosquitto DB is around, the problem persists. To me was confusing, because I was restarting my containers and trying new options and I wasn't able to see changes. At some point, I started from scratched with the publisher not flagging messages as retained and of course, having not a single message flagged as retained, and finally things start to behave as expected.

    It makes me think there is an error or bug here, but I am not sure if this is expected or not. To me, the mosquitto bug, if there is one, is that the queue of undelivered messages keep growing even when the new messages aren't flagged as retained.

    When things are configured "correctly", Mosquitto can deliver messages fast. I did a test having a parallel mosquitto server on an EC2 instance and +50K messages (150MB) were copied in about 3 minutes or so. The speed here most likely was mostly constrained by the 4G internet connection rather than the a broker limitation.