Search code examples
scalaapache-sparkspark-streamingfilestream

Spark streaming wordcount using filstream doesn't print result


I'm using file as Spark streaming, i want to count the words in the stream, but the application prints nothing, here's my code. I'm using Scala on Cloudera environment

 import org.apache.spark.SparkConf
 import org.apache.spark.streaming._
 import org.apache.spark.streaming.StreamingContext

 object TwitterHashtagStreaming {

 def main(args: Array[String]) : Unit = {

val conf = new SparkConf().setAppName("TwitterHashtagStreaming").setMaster("local[2]").set("spark.executor.memory","1g");

val streamingC = new StreamingContext(conf,Seconds(5))

val streamLines = streamingC.textFileStream("file:///home/cloudera/Desktop/wordstream")
val words = streamLines.flatMap(_.split(" "))
val counts = words.map(word => (word, 1)).reduceByKey(_ + _)

 counts.print()

 streamingC.start()
 streamingC.awaitTermination()
}

 }

Solution

  • Please carefully refer the document:

    def textFileStream(directory: String): DStream[String]
    

    Create a input stream that monitors a Hadoop-compatible filesystem for new files and reads them as text files (using key as LongWritable, value as Text and input format as TextInputFormat). Files must be written to the monitored directory by "moving" them from another location within the same file system. File names starting with . are ignored.

    In a word, it is a change detector, you must start your streaming service, then write your data in your monitor directory.

    This semantic will simulate the "stream concept" when it is actually deployed in production environment, for example, network packets will gradually income like your files.