I have a project with Apache-Samza and I have a problem with duplicate data.
This is my checkpoint configuration :
task.checkpoint.factory=org.apache.samza.checkpoint.kafka.KafkaCheckpointManagerFactory
task.checkpoint.system=kafka
task.checkpoint.replication.factor=2
task.commit.ms=20000
On the documentation We can read this :
If task.checkpoint.factory is configured, this property determines how often a checkpoint is written. The value is the time between checkpoints, in milliseconds. The frequency of checkpointing affects failure recovery: if a container fails unexpectedly (e.g. due to crash or machine failure) and is restarted, it resumes processing at the last checkpoint. Any messages processed since the last checkpoint on the failed container are processed again. Checkpointing more frequently reduces the number of messages that may be processed twice, but also uses more resources.
So can I change task.commit.ms=20000
to 250ms or 1ms. It's good or very bad ? I have a very good cluster.
Why I need change this, because this Samza(worker) crash 1-3 time each week. And now the temporary solution is commit offset each time.
Documentation ref :
My solution I know it's not the solution for all problem. It's change the task.commit.ms
to the same thing of task.shutdown.ms=5000
.