Search code examples
scalatwitterapache-sparktwitter4jspark-streaming

Spark Streaming - Twitter - Filtering tweet data


I am new to Scala and Spark. I am working on spark streaming with twitter data. I flatmapped the stream into individual words.Now, I need to eliminate tweet words like which start with #,@ and words like RT from streaming data before processing them. I knew it is quite easy to do.I wrote filter for this, but it is not working. Can anyone help on this. My code is

val sparkConf = new SparkConf().setMaster("local[2]")
    val ssc = new StreamingContext(sparkConf, Seconds(2))
    val stream = TwitterUtils.createStream(ssc, None)
    //val lanFilter = stream.filter(status => status.getLang == "en")
    val RDD1 = stream.flatMap(status => status.getText.split(" "))
    val filterRDD = RDD1.filter(word =>(word !=word.startsWith("#")))
    filterRDD.print()

Also language filter is showing error.

Thank you.


Solution

  • Is your lambda expression correct? I think you want:

    val filterRDD = RDD1.filter(word => !word.startsWith("#"))