Search code examples
javaapache-sparkstreamingrecord

Spark Streaming does not display any record on application UI


I am new to spark and I am trying to run a simple spark streaming application that reads data from a csv file and displays it. Seems like spark streaming works but it still shows "0" records on the Streaming UI application.Here is my code:

public class App {
  public static void main(String[] args) throws Exception {
    // Get an instance of spark-conf, required to build the spark session
    SparkConf conf = new SparkConf().setAppName("StreamingExample").setMaster("local");
    JavaStreamingContext jsc = new JavaStreamingContext(conf, new Duration(3000));
    //JavaSparkContext ssc= new JavaSparkContext(conf);
    jsc.checkpoint("checkpoint");

    System.out.println("Session created");

    JavaDStream < String > lines = jsc.textFileStream("C:\\Users\\Areeha\\eclipse-workspace\\learnspark\\src\\main\\java\\com\\example\\learnspark");
    lines.print();
    lines.foreachRDD(rdd - > rdd.foreach(x - > System.out.println(x)));

    JavaPairDStream < LongWritable, Text > streamedFile = jsc.fileStream("C:\\Users\\Areeha\\eclipse-workspace\\learnspark\\src\\main\\java\\com\\example\\learnspark", LongWritable.class, Text.class, TextInputFormat.class);
    streamedFile.print();
    System.out.println("File loaded!");
    System.out.println(streamedFile.count());
    System.out.println(lines.count());

    jsc.start();
    try {
      jsc.awaitTermination();
    } catch (InterruptedException e) {
      // TODO Auto-generated catch block
      e.printStackTrace();
    }


  }
}

This is what I get on console:

Using Spark 's default log4j profile: org/apache/spark/log4j-defaults.properties
19 / 11 / 21 09: 24: 50 INFO SparkContext: Running Spark version 2.4 .4
19 / 11 / 21 09: 24: 50 WARN NativeCodeLoader: Unable to load native - hadoop library
for your platform...using builtin - java classes where applicable
19 / 11 / 21 09: 24: 50 INFO SparkContext: Submitted application: StreamingExample
19 / 11 / 21 09: 24: 50 INFO SecurityManager: Changing view acls to: Areeha
19 / 11 / 21 09: 24: 50 INFO SecurityManager: Changing modify acls to: Areeha
19 / 11 / 21 09: 24: 50 INFO SecurityManager: Changing view acls groups to:
  19 / 11 / 21 09: 24: 50 INFO SecurityManager: Changing modify acls groups to:
  19 / 11 / 21 09: 24: 50 INFO SecurityManager: SecurityManager: authentication disabled;
ui acls disabled;
users with view permissions: Set(Areeha);
groups with view permissions: Set();
users with modify permissions: Set(Areeha);
groups with modify permissions: Set()
19 / 11 / 21 09: 24: 51 INFO Utils: Successfully started service 'sparkDriver'
on port 57635.
19 / 11 / 21 09: 24: 51 INFO SparkEnv: Registering MapOutputTracker
19 / 11 / 21 09: 24: 51 INFO SparkEnv: Registering BlockManagerMaster
19 / 11 / 21 09: 24: 51 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper
for getting topology information
19 / 11 / 21 09: 24: 51 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
19 / 11 / 21 09: 24: 51 INFO DiskBlockManager: Created local directory at C: \Users\ Areeha\ AppData\ Local\ Temp\ blockmgr - 9 d8ba7c2 - 3 b21 - 419 c - 8711 - d85f7d1704a1
19 / 11 / 21 09: 24: 51 INFO MemoryStore: MemoryStore started with capacity 1443.6 MB
19 / 11 / 21 09: 24: 51 INFO SparkEnv: Registering OutputCommitCoordinator
19 / 11 / 21 09: 24: 52 INFO Utils: Successfully started service 'SparkUI'
on port 4040.
19 / 11 / 21 09: 24: 52 INFO SparkUI: Bound SparkUI to 0.0 .0 .0, and started at http: //192.168.2.8:4040
  19 / 11 / 21 09: 24: 52 INFO Executor: Starting executor ID driver on host localhost
19 / 11 / 21 09: 24: 52 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService'
on port 57648.
19 / 11 / 21 09: 24: 52 INFO NettyBlockTransferService: Server created on 192.168 .2 .8: 57648
19 / 11 / 21 09: 24: 52 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy
for block replication policy
19 / 11 / 21 09: 24: 52 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 192.168 .2 .8, 57648, None)
19 / 11 / 21 09: 24: 52 INFO BlockManagerMasterEndpoint: Registering block manager 192.168 .2 .8: 57648 with 1443.6 MB RAM, BlockManagerId(driver, 192.168 .2 .8, 57648, None)
19 / 11 / 21 09: 24: 52 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 192.168 .2 .8, 57648, None)
19 / 11 / 21 09: 24: 52 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 192.168 .2 .8, 57648, None)
19 / 11 / 21 09: 24: 52 WARN StreamingContext: spark.master should be set as local[n], n > 1 in local mode
if you have receivers to get data, otherwise Spark jobs will not get resources to process the received data.
Session created
19 / 11 / 21 09: 24: 52 INFO FileInputDStream: Duration
for remembering RDDs set to 60000 ms
for org.apache.spark.streaming.dstream.FileInputDStream @14151bc5
19 / 11 / 21 09: 24: 52 INFO FileInputDStream: Duration
for remembering RDDs set to 60000 ms
for org.apache.spark.streaming.dstream.FileInputDStream @151335cb
File loaded!
  org.apache.spark.streaming.api.java.JavaDStream @46d8f407
org.apache.spark.streaming.api.java.JavaDStream @2788d0fe
19 / 11 / 21 09: 24: 53 INFO FileBasedWriteAheadLog_ReceivedBlockTracker: Recovered 4 write ahead log files from file: /C:/Users / Areeha / eclipse - workspace / learnspark / checkpoint / receivedBlockMetadata
19 / 11 / 21 09: 24: 53 INFO FileInputDStream: Slide time = 3000 ms
19 / 11 / 21 09: 24: 53 INFO FileInputDStream: Storage level = Serialized 1 x Replicated
19 / 11 / 21 09: 24: 53 INFO FileInputDStream: Checkpoint interval = null
19 / 11 / 21 09: 24: 53 INFO FileInputDStream: Remember interval = 60000 ms
19 / 11 / 21 09: 24: 53 INFO FileInputDStream: Initialized and validated org.apache.spark.streaming.dstream.FileInputDStream @14151bc5
19 / 11 / 21 09: 24: 53 INFO MappedDStream: Slide time = 3000 ms
19 / 11 / 21 09: 24: 53 INFO MappedDStream: Storage level = Serialized 1 x Replicated
19 / 11 / 21 09: 24: 53 INFO MappedDStream: Checkpoint interval = null
19 / 11 / 21 09: 24: 53 INFO MappedDStream: Remember interval = 3000 ms
19 / 11 / 21 09: 24: 53 INFO MappedDStream: Initialized and validated org.apache.spark.streaming.dstream.MappedDStream @528f8f8b
19 / 11 / 21 09: 24: 53 INFO ForEachDStream: Slide time = 3000 ms
19 / 11 / 21 09: 24: 53 INFO ForEachDStream: Storage level = Serialized 1 x Replicated
19 / 11 / 21 09: 24: 53 INFO ForEachDStream: Checkpoint interval = null
19 / 11 / 21 09: 24: 53 INFO ForEachDStream: Remember interval = 3000 ms
19 / 11 / 21 09: 24: 53 INFO ForEachDStream: Initialized and validated org.apache.spark.streaming.dstream.ForEachDStream @4cbf4f53
19 / 11 / 21 09: 24: 53 INFO FileInputDStream: Slide time = 3000 ms
19 / 11 / 21 09: 24: 53 INFO FileInputDStream: Storage level = Serialized 1 x Replicated
19 / 11 / 21 09: 24: 53 INFO FileInputDStream: Checkpoint interval = null
19 / 11 / 21 09: 24: 53 INFO FileInputDStream: Remember interval = 60000 ms
19 / 11 / 21 09: 24: 53 INFO FileInputDStream: Initialized and validated org.apache.spark.streaming.dstream.FileInputDStream @14151bc5
19 / 11 / 21 09: 24: 53 INFO MappedDStream: Slide time = 3000 ms
19 / 11 / 21 09: 24: 53 INFO MappedDStream: Storage level = Serialized 1 x Replicated
19 / 11 / 21 09: 24: 53 INFO MappedDStream: Checkpoint interval = null
19 / 11 / 21 09: 24: 53 INFO MappedDStream: Remember interval = 3000 ms
19 / 11 / 21 09: 24: 53 INFO MappedDStream: Initialized and validated org.apache.spark.streaming.dstream.MappedDStream @528f8f8b
19 / 11 / 21 09: 24: 53 INFO ForEachDStream: Slide time = 3000 ms
19 / 11 / 21 09: 24: 53 INFO ForEachDStream: Storage level = Serialized 1 x Replicated
19 / 11 / 21 09: 24: 53 INFO ForEachDStream: Checkpoint interval = null
19 / 11 / 21 09: 24: 53 INFO ForEachDStream: Remember interval = 3000 ms
19 / 11 / 21 09: 24: 53 INFO ForEachDStream: Initialized and validated org.apache.spark.streaming.dstream.ForEachDStream @58d63b16
19 / 11 / 21 09: 24: 53 INFO FileInputDStream: Slide time = 3000 ms
19 / 11 / 21 09: 24: 53 INFO FileInputDStream: Storage level = Serialized 1 x Replicated
19 / 11 / 21 09: 24: 53 INFO FileInputDStream: Checkpoint interval = null
19 / 11 / 21 09: 24: 53 INFO FileInputDStream: Remember interval = 60000 ms
19 / 11 / 21 09: 24: 53 INFO FileInputDStream: Initialized and validated org.apache.spark.streaming.dstream.FileInputDStream @151335cb
19 / 11 / 21 09: 24: 53 INFO ForEachDStream: Slide time = 3000 ms
19 / 11 / 21 09: 24: 53 INFO ForEachDStream: Storage level = Serialized 1 x Replicated
19 / 11 / 21 09: 24: 53 INFO ForEachDStream: Checkpoint interval = null
19 / 11 / 21 09: 24: 53 INFO ForEachDStream: Remember interval = 3000 ms
19 / 11 / 21 09: 24: 53 INFO ForEachDStream: Initialized and validated org.apache.spark.streaming.dstream.ForEachDStream @748e9b20
19 / 11 / 21 09: 24: 53 INFO RecurringTimer: Started timer
for JobGenerator at time 1574349894000
19 / 11 / 21 09: 24: 53 INFO JobGenerator: Started JobGenerator at 1574349894000 ms
19 / 11 / 21 09: 24: 53 INFO JobScheduler: Started JobScheduler
19 / 11 / 21 09: 24: 53 INFO StreamingContext: StreamingContext started
19 / 11 / 21 09: 24: 54 INFO FileInputDStream: Finding new files took 9 ms
19 / 11 / 21 09: 24: 54 INFO FileInputDStream: New files at time 1574349894000 ms:

  19 / 11 / 21 09: 24: 54 INFO FileInputDStream: Finding new files took 3 ms
19 / 11 / 21 09: 24: 54 INFO FileInputDStream: New files at time 1574349894000 ms:

  19 / 11 / 21 09: 24: 54 INFO JobScheduler: Added jobs
for time 1574349894000 ms
19 / 11 / 21 09: 24: 54 INFO JobGenerator: Checkpointing graph
for time 1574349894000 ms
19 / 11 / 21 09: 24: 54 INFO DStreamGraph: Updating checkpoint data
for time 1574349894000 ms
19 / 11 / 21 09: 24: 54 INFO JobScheduler: Starting job streaming job 1574349894000 ms .0 from job set of time 1574349894000 ms
19 / 11 / 21 09: 24: 54 INFO DStreamGraph: Updated checkpoint data
for time 1574349894000 ms
19 / 11 / 21 09: 24: 54 INFO CheckpointWriter: Submitted checkpoint of time 1574349894000 ms to writer queue
19 / 11 / 21 09: 24: 54 INFO CheckpointWriter: Saving checkpoint
for time 1574349894000 ms to file 'file:/C:/Users/Areeha/eclipse-workspace/learnspark/checkpoint/checkpoint-1574349894000'
  -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -
  Time: 1574349894000 ms
  -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -

  19 / 11 / 21 09: 24: 54 INFO JobScheduler: Finished job streaming job 1574349894000 ms .0 from job set of time 1574349894000 ms
19 / 11 / 21 09: 24: 54 INFO JobScheduler: Starting job streaming job 1574349894000 ms .1 from job set of time 1574349894000 ms
19 / 11 / 21 09: 24: 54 INFO SparkContext: Starting job: foreach at App.java: 79
19 / 11 / 21 09: 24: 54 INFO DAGScheduler: Job 0 finished: foreach at App.java: 79, took 0.002286 s
19 / 11 / 21 09: 24: 54 INFO JobScheduler: Finished job streaming job 1574349894000 ms .1 from job set of time 1574349894000 ms
19 / 11 / 21 09: 24: 54 INFO JobScheduler: Starting job streaming job 1574349894000 ms .2 from job set of time 1574349894000 ms
  -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -
  Time: 1574349894000 ms
  -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -

And following appears on my Streaming UI application:enter image description here

I don't know what I am doing wrong. It is neither displaying anything nor adding any record to it. I earlier had specified the exact csv file, which did not work so I tried giving the path of the entire folder that has csv.Does anyone have any idea what am I missing? Thanks in advance.


Solution

  • TextFileStream does not use a Receiver thread and therefore does not log the records in the UI as other sources do:

    File Streams
    For reading data from files on any file system compatible with the HDFS API (that is, HDFS, S3, NFS, etc.), a DStream can be created as via StreamingContext.fileStream[KeyClass, ValueClass, InputFormatClass].
    
    File streams do not require running a receiver so there is no need to allocate any cores for receiving file data.
    

    Source: https://spark.apache.org/docs/2.3.1/streaming-custom-receivers.html

    Somebody opened a PR on this JIRA ticket with changes in the Spark logic so this information but the ticket does not have a fix version set.

    What I usually do to know how many records entered each batch, is to log the count when processing the RDD in the forEachRDD:

    lines.forEachRDD( rdd -> {
    // You might want to cache the rdd before counting if you are dealing with large RDDs
    logger.debug(s"${rdd.count() records found")
    })
    
    

    Edit: Also regarding your file not being processed, you might want to set to DEBUG this package org.apache.spark.streaming.dstream.FileInputDStream in your logging configuration since it says which files it "sees" and why does it take it or not (mostly is because the timestamp being too old).