Search code examples
dockerapache-kafkadocker-composeapache-kafka-streams

How is it possible that data in Kafka survives container recycling?


First, I do not know whether this issue is with Kafka or with Docker … I am a rookie regarding both topics. But I assume that it is more a Docker than a Kafka problem (in fact it will be my problem not really understanding one or the other …).

I installed Docker on a Raspberry 4 and created Docker images for Kafka and for Zookeeper; I had to create them by myself because 64-bit Raspi was not supported by any of the existing images (at least I could not find anyone). But I got them working.

Next I implemented the Kafka Streams example (Wordcount) from the Kafka documentation; it runs fine, counting the words in all the texts you push into it, keeping the numbers from all previous runs. That is somehow expected; at least it is described that way in that documentation.

So after some test runs I wanted to reset the whole thing.

I thought the easiest way to get there is to shut down the docker containers, delete the mounted folders on the host and start over.

But that does not work: the word counters are still there! Meaning the word count did not start from 0 …

Ok, next turn: not only removing the containers, but rebuild the images, too! Both, Zookeeper and Kafka, of course!

No difference! The word count from all the previous runs were retained.

Using docker system prune --volumes made no difference also …

From my limited understanding of Docker, I assumed that any runtime data is stored in the container, or in the mounted folders (volumes). So when I delete the containers and the folders on the Docker host that were mounted by the containers, I expect that any status would have gone.

Obviously not … so I missed something important here, most probably with Docker.

The docker-compose file I used:

version: '3'

services:
  zookeeper:
    image: tquadrat/zookeeper:latest
    ports:
      - "2181:2181"
      - "2888:2888"
      - "3888:3888"
      - "8080:8080"
    volumes:
      - /data/zookeeper/config:/config
      - /data/zookeeper/data:/data
      - /data/zookeeper/datalog:/datalog
      - /data/zookeeper/logs:/logs
    environment:
      ZOO_SERVERS: "server.1=zookeeper:2888:3888;2181"
    restart: always

  kafka:
    image: tquadrat/kafka:latest
    depends_on:
      - zookeeper
    ports:
      - "9091:9091"
    volumes:
      - /data/kafka/config:/config
      - /data/kafka/logs:/logs
    environment:
      KAFKA_LISTENERS: "INTERNAL://kafka:29091,EXTERNAL://:9091"
      KAFKA_ADVERTISED_LISTENERS: "INTERNAL://kafka:29091,EXTERNAL://TCON-PI4003:9091"
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: "INTERNAL:PLAINTEXT,EXTERNAL:PLAINTEXT"
      KAFKA_INTER_BROKER_LISTENER_NAME: "INTERNAL"
      KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
      KAFKA_DELETE_TOPIC_ENABLE: "true"
    restart: always

The script file I used to clear out the mounted folders:

#!/bin/sh

set -eux

DATA="/data"
KAFKA_DATA="$DATA/kafka"
ZOOKEEPER_DATA="$DATA/zookeeper"

sudo rm -R "$KAFKA_DATA"
sudo rm -R "$ZOOKEEPER_DATA"

mkdir -p "$KAFKA_DATA/config" "$KAFKA_DATA/logs"
mkdir -p "$ZOOKEEPER_DATA/config" "$ZOOKEEPER_DATA/data" "$ZOOKEEPER_DATA/datalog" "$ZOOKEEPER_DATA/logs"

Any ideas?


Solution

  • Kafka Streams stores its own state under the "state.dir" config on the Host machine its running on. In Apache Kafka libraries, this is under /tmp. First check if you have overridden that property in your code.

    As far as Docker goes, try without volumes first.

    Using docker system prune --volumes made no difference also …

    That would clean unattached volumes made with docker volume create or volumes: in Compose, not host-mounted directories.