Search code examples
duplicatesamazon-sqsapache-pulsar

How does Deduplication work in Apache Pulsar?


I'm trying to use Deduplication feature of Apache Pulsar.

brokerDeduplicationEnabled=true is set in standalone.conf file, But when I send the same message from producer multiple times, I get all the messages at consumer end, is this expected behaviour ?

Isn't deduplication means content based deduplication as in AWS SQS ?

Here is my producer code for reference.

import pulsar
import json 
   
client = pulsar.Client('pulsar://localhost:6650')    
producer = client.create_producer(
    'persistent://public/default/my-topic',
    send_timeout_millis=0,
    producer_name="producer-1")

data = {'key1': 0, 'key2' : 1}

for i in range(10):
    encoded_data = json.dumps(data).encode('utf-8') 
    producer.send(encoded_data)

client.close()

Solution

  • In Pulsar, deduplication doesn't work on the content of the message. It works on the individual message. The intention isn't to deduplicate the content but to ensure an individual message cannot be be published more than once.

    When you send a message, Pulsar assigns it an unique identifier. Deduplication ensures that in failure scenarios the same message doesn't get stored in (or written to) Pulsar more than once. It does this by comparing the identifier to a list of already stored identifiers. If the identifier of the message has already been stored, Pulsar ignores it. This way, Pulsar will only store the message once. This is part of Pulsar's mechanism to guarantee a message will be sent exactly once.

    For more details, see PIP 6: Guaranteed Message Deduplication.