Search code examples
apache-kafkaksqldb

Why ksql comsumer all message from kafka even I add limit 1 to query


I run this query

select * from USER_EVENTS emit changes limit 1;

USER_EVENTS is a stream. Before this i set auto.offset.reset to earliest. This query run slowly. I don't know why. And then i show queries to check consumer id of above query and search it in kafka connect. And i find out query need fetch all message in topic, although i only need one row. Is that true, and why it need fetch all ? I think fetch one is enough because i had add limit 1 to query. Topic behind USER_EVENTS have ~1 m message. I use ksqlServer 6.1.0 and the same for ksqlCli.


Solution

  • This is what ksqldb is supposed to do. Consume the entire stream and materialize a table from that. Your query even says

    emit changes
    

    which means it will go through your messages one by one and update the table in near real time. LIMIT 1 only means, that it will show a single message (and update that) instead of showing a growing table, but it consumes the stream either way.

    The alternative would be

    emit final
    

    which would only show the final result, but still go trough the entire stream.

    At least to my knowledge, this is not possible with ksqldb.


    If you just need to look at one message interactively, I recommend to use a CLI tool like kcat or https://github.com/birdayz/kaf which all have a config option to consume only a single message.


    If you need it programmatically, I would probably try to write a consumer by hand and simple call poll() once instead of the standard poll loop.


    If you want "hacky" quickfix, you could also try to set

    SET 'auto.offset.reset'='earliest';
    

    for your query in ksqldb. This will still go through the entire stream, but start with the newest available message. So it would ignore everything that is in the topic.