Search code examples
scalaapache-sparkapache-spark-sql

java.lang.ArrayIndexOutOfBoundsException: 0 : If Directory Does Not Have Files


Please assist me with the following scenario. I'm scanning the last two hours of folders and then taking the most recent CSV files and generating a single list. If both the hours folders contain files, the code below is working as expected. but if any folder does not contain any files, then it is showing "ArrayIndexOutOfBoundsException: 0"

code :

import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs._
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import scala.language.postfixOps
val hdfsConf = new Configuration();
var path="/user/hdfs/test/input"
var finalFiles = List[String]()
val currentTs = java.time.LocalDateTime.now
val hours=2
var paths = (0 until hours.toInt).map(h => currentTs.minusHours(h))
   .map(ts=>s"${path}/partition_date=${ts.toLocalDate}/hour=${ts.toString.substring(11, 13)}")
.toList

// paths: List[String] = List(/user/hdfs/test/input/partition_date=2022-11-30/hour=19,
// /user/hdfs/test/input/partition_date=2022-11-30/hour=18)

for (eachfolder <- paths) {
var New_Folder_Path: String = eachfolder
var fs = org.apache.hadoop.fs.FileSystem.get(spark.sparkContext.hadoopConfiguration)
var pathstatus = fs.listStatus(new Path(New_Folder_Path))
var currpathfiles = pathstatus.map(x => Row(x.getPath.toString, x.getModificationTime))
var latestFile = spark.sparkContext.parallelize(currpathfiles)
.map(row => (row.getString(0), row.getLong(1)))
.toDF("FilePath", "ModificationTime")
.filter(col("FilePath")
.like("%.csv%"))
.sort($"ModificationTime".desc)
.select(col("FilePath")).limit(1)
.map(row => row.getString(0)).collectAsList.get(0)

finalFiles = latestFile :: finalFiles
}

Erorr:

java.lang.ArrayIndexOutOfBoundsException: 0 

Solution

  • You're running into an issue when trying to obtain the 0th element from an empty list. You can avoid this by using List's headOption method along with foreach on the resulting Option.

    spark.sparkContext.parallelize(currpathfiles)
      .map(row => (row.getString(0), row.getLong(1)))
      ...
      .map(row => getString(0))
      .collectAsList.headOption
      .foreach(latestFile =>  finalFiles = latestFile :: finalFiles)
    

    Also note that instead of assigning latestFile to a var, my implementation just prepends it within the Option's foreach to the finalFiles list (for each will only act when there exists an element after we call collectAsList).