Search code examples
apache-sparkapache-kafkaspark-streamingtwitter-streaming-apispark-streaming-kafka

What are drawbacks of Spark Kafka Integration on local machine for real time twitter streaming analysis?


I am using Spark-Kafka Integration for working on my project which is to find top trending hashtags on twitter. For this, i am using Kafka for pushing tweets through tweepy Streaming and on the consumer side i am using Spark Streaming for DStream and RDD transformations...

My question is that whether running the streaming process through Kafka for some time may lead to storage issues as i am running both producer and consumer on my local machine... How long can i safely execute the producer (as i need it to run for sometime to get the right trending counts..) ?

Also will it be better if i run it on cloud platforms such as AWS ?


Solution

  • It's not clear what time window you're using, or where Kafka is running. Calculating trends over 10 minutes or an hour or so, shouldn't take up much disk at all on the Spark cluster.

    Kafka storage will of course need to be large enough for your use case

    Tweets are not very large. Filtering out hashtags only makes them smaller.

    Note: Spark seems like overkill for this, as you could do the same with Kafka Connect for ingest and ksqlDB for computation