Search code examples
apache-kafkatwitter

Extract particular data from Kafka topic


I'm doing real time streaming on Twitter and wonder is there a way to extract only messages and certain values from Kafka topic?


Solution

  • You can use ksqlDB to do this. For example:

    ksql> CREATE STREAM TWEETS WITH (KAFKA_TOPIC='twitter_01', VALUE_FORMAT='Avro');
    
    ksql> SELECT USER->SCREENNAME, TEXT FROM TWEETS WHERE TEXT LIKE '%cool%' EMIT CHANGES;
    
    +-------------------+------------------------------------------------------------------------------------------+
    |USER__SCREENNAME   |TEXT                                                                                      |
    +-------------------+------------------------------------------------------------------------------------------+
    |MobileGist         |This is super cool!! Great work @houchens_kim!                                            |
    

    You can also build a new topic with the results of this if you want

    ksql> CREATE STREAM COOL_TWEETS AS SELECT USER->SCREENNAME, TEXT FROM TWEETS WHERE TEXT LIKE '%cool%' EMIT CHANGES;
    

    Since you tagged Python it's worth pointing out that you can call ksqlDB using its REST API from Python. Here's an example.

    Ref: Exploring ksqlDB with Twitter Data