Search code examples
apache-sparkhbasedelay

Apache Spark Delay Between Jobs


my as you can see, my small application has 4 jobs which run for a total duration of 20.2 seconds, however there is a big delay between job 1 and 2 causing the total time to be over a minute. Job number 1 runJob at SparkHadoopMapReduceWriter.scala:88 is performing a bulkupload of HFiles into a HBase table. Here is the code I used to load to load the files

val outputDir = new Path(HBaseUtils.getHFilesStorageLocation(resolvedTableName))
val job = Job.getInstance(hBaseConf)
job.getConfiguration.set(TableOutputFormat.OUTPUT_TABLE, resolvedTableName)
job.setOutputFormatClass(classOf[HFileOutputFormat2])
job.setMapOutputKeyClass(classOf[ImmutableBytesWritable])
job.setMapOutputValueClass(classOf[KeyValue])
val connection = ConnectionFactory.createConnection(job.getConfiguration)
val hBaseAdmin = connection.getAdmin
val table = TableName.valueOf(Bytes.toBytes(resolvedTableName))
val tab = connection.getTable(table).asInstanceOf[HTable]
val bulkLoader = new LoadIncrementalHFiles(job.getConfiguration)
preBulkUploadCallback.map(callback => callback())
bulkLoader.doBulkLoad(outputDir, hBaseAdmin, tab, tab.getRegionLocator)

If anyone has any ideas, I would be very greatful

Spark History UI - Jobs Timeline


Solution

  • I can see there are 26 tasks in job 1 which is based on the number of hfiles created. Even though the job 2 shows completed in 2s, it takes some time to copy these files to target location and that's why you are getting a delay between job 2 and 3. This can be avoided by reducing the number of tasks in job 1.