Search code examples
scalaapache-sparkrdd

Loading files based on pattern matching in spark


I have 31 inputs files with named from date=2018-01-01 till date=2018-01-31.

I am able to load all these files into an rdd this way:

val input = sc.textFile("hdfs://user/cloudera/date=*")

But what if I want to load the files for only 1 week? (files from date=2018-01-15 to date=2018-01-22).


Solution

  • You can specify your files individually to textFile by joining them with ,:

    val files = (15 to 22).map(
      day => "hdfs://user/cloudera/date=2018-01-" + "%02d".format(day)
    ).mkString(",")
    

    which produces:

    hdfs://user/cloudera/date=2018-01-15,hdfs://user/cloudera/date=2018-01-16,hdfs://user/cloudera/date=2018-01-17,hdfs://user/cloudera/date=2018-01-18,hdfs://user/cloudera/date=2018-01-19,hdfs://user/cloudera/date=2018-01-20,hdfs://user/cloudera/date=2018-01-21,hdfs://user/cloudera/date=2018-01-22
    

    and you can call it this way:

    val input = sc.textFile(files)
    

    Notice the formatting ("%02d".format(day)) of the day in order to add the leading 0 to days between 1 and 9.