I'm using file as Spark streaming, i want to count the words in the stream, but the application prints nothing, here's my code. I'm using Scala on Cloudera environment
import org.apache.spark.SparkConf
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext
object TwitterHashtagStreaming {
def main(args: Array[String]) : Unit = {
val conf = new SparkConf().setAppName("TwitterHashtagStreaming").setMaster("local[2]").set("spark.executor.memory","1g");
val streamingC = new StreamingContext(conf,Seconds(5))
val streamLines = streamingC.textFileStream("file:///home/cloudera/Desktop/wordstream")
val words = streamLines.flatMap(_.split(" "))
val counts = words.map(word => (word, 1)).reduceByKey(_ + _)
counts.print()
streamingC.start()
streamingC.awaitTermination()
}
}
Please carefully refer the document:
def textFileStream(directory: String): DStream[String]
Create a input stream that monitors a Hadoop-compatible filesystem for new files and reads them as text files (using key as LongWritable, value as Text and input format as TextInputFormat). Files must be written to the monitored directory by "moving" them from another location within the same file system. File names starting with . are ignored.
In a word, it is a change detector, you must start your streaming service, then write your data in your monitor directory.
This semantic will simulate the "stream concept" when it is actually deployed in production environment, for example, network packets will gradually income like your files.