twitter hdfs twitter4j apache-spark spark-streaming

Persisting tweets using Spark Streaming

To begin with, our requirement is fairly simple. When the tweets come in, all we need to do is persist them on HDFS (at regular intervals).

The 'checkpoint' API of JavaStreamingContext looked promising, but upon further review it seems to serve a different purpose. (Also, I keep getting '/checkpoint/temp, error: No such file or directory (2)' error, but let's not worry about that for now).

Question: JavaDStream doesn't have a 'saveAsHadoopFiles' method - which kinda makes sense. I guess saving to Hadoop from a streaming job is not a good idea.

What's the recommended approach? Should I write the incoming 'tweet' to a Kafka queue and then use a tool such as 'Camus'(https://github.com/linkedin/camus) to push to HDFS?

Solution

Came across this awesome blog entry that confirmed my ideas. The author built a 'Forex trading system' using technologies such as Kafka, Storm, Camus. This use case is similar to mine, so I am going to use this design & tools. Thanks.

http://insightdataengineering.com/blog/Building_a_Forex_trading_platform_using_Kafka_Storm_Cassandra.html