Search code examples
javaapache-kafkaapache-kafka-streamslong-integerword-count

Kafka Streams Twitter Wordcount - Count Value not Long after Serialization


I am running a Kafka Cluster Docker Compose on an AWS EC2 instance. I want to receive all the tweets of a specific keyword and push them to Kafka. This works fine. But I also want to count the most used words of those tweets.

This is the WordCount code:

import org.apache.kafka.common.serialization.Serdes;
import org.apache.kafka.streams.KafkaStreams;
import org.apache.kafka.streams.kstream.KStream;
import org.apache.kafka.streams.StreamsBuilder;
import java.util.Arrays;
import java.util.Properties;
import org.apache.kafka.streams.Topology;
import org.apache.kafka.streams.kstream.Materialized;
import org.apache.kafka.streams.kstream.Produced;
import java.util.concurrent.CountDownLatch;

import static org.apache.kafka.streams.StreamsConfig.APPLICATION_ID_CONFIG;
import static org.apache.kafka.streams.StreamsConfig.BOOTSTRAP_SERVERS_CONFIG;
import static org.apache.kafka.streams.StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG;
import static org.apache.kafka.streams.StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG;

public class WordCount {

    public static void main(String[] args) {

        final StreamsBuilder builder = new StreamsBuilder();

        final KStream<String, String> textLines = builder
                .stream("test-topic");

        textLines
                .flatMapValues(value -> Arrays.asList(value.toLowerCase().split("\\W+")))
                .groupBy((key, value) -> value)
                .count(Materialized.as("WordCount"))
                .toStream()
                .to("test-output", Produced.with(Serdes.String(), Serdes.Long()));

        final Topology topology = builder.build();

        Properties props = new Properties();
        props.put(APPLICATION_ID_CONFIG, "streams-word-count");
        props.put(BOOTSTRAP_SERVERS_CONFIG, "ec2-ip:9092");
        props.put(DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass());
        props.put(DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass());

        final KafkaStreams streams = new KafkaStreams(topology, props);

        final CountDownLatch latch = new CountDownLatch(1);
        Runtime.getRuntime().addShutdownHook(
                new Thread("streams-shutdown-hook") {
                    @Override
                    public void run() {
                        streams.close();
                        latch.countDown();
                    }
                });
        try {
            streams.start();
            latch.await();
        } catch (Throwable e) {
            System.exit(1);
        }
        System.exit(0);
    }
}

When I check the output topic in the Control Center, it looks like this:

Key

Value

Looks like it's working as far as splitting the tweets into single words. But the count value isn't in Long format, although it is specified in the code.

When I use the kafka-console-consumer to consume from this topic, it says:

"Size of data received by LongDeserializer is not 8"


Solution

  • Control Center UI and console consumer can only render UTF8 data, by default.

    You'll need to explicitly pass LongDeserializer to the console consumer, as the value deserializer only