Search code examples
scalaapache-sparkspark-submitspark-shell

spark read contents of zip file in HDFS


I Am trying to read data from zip file

can read whole text file as below

val f = sc.wholeTextFiles("hdfs://")

but don`t know, how to read text data inside zip file

Is there any possible way to do it, if yes please let me know.


Solution

  • You can create an RDD from the zipFile with the newAPIHadoopFile command.

    import com.cotdp.hadoop.ZipFileInputFormat
    import org.apache.hadoop.io.BytesWritable
    import org.apache.hadoop.io.Text
    import org.apache.hadoop.mapreduce.Job
    
    val zipFileRDD = sc.newAPIHadoopFile(
            "hdfs://tmp/sample_zip/LoanStats3a.csv.zip",
            classOf[ZipFileInputFormat],
            classOf[Text],
            classOf[BytesWritable],
            new Job().getConfiguration())
    println("The file contents are: " + zipFileRDD.map(s => new String(s._2.getBytes())).first())