I have archive with zip files that I would like to open 'through' Spark in streaming and write in streaming the unzip files in other directory that kip the name of the zip file(one by one).
import zipfile
import io
def zip_extract(x):
in_memory_data = io.BytesIO(x[1])
file_obj = zipfile.ZipFile(in_memory_data, "r")
files = [i for i in file_obj.namelist()]
return dict(zip(files, [file_obj.open(file).read() for file in files]))
Is there an easy way to read and write the above code in streaming ? Thank you for your help.
As far as I know, Spark can't read archives out of the box.
A ZIP file is both archiving and compressing data. If you can, use a program like gzip
to compress the data but keep each file separate, so don't archive multiple files into a single one.
If the archive is a given, and can't be changed. You can consider reading it with sparkContext.binaryFiles
(https://spark.apache.org/docs/latest/api/scala/org/apache/spark/index.html) This would allow you to have the zipped file in a byte array in spark, so you can write a mapper function which can unzip and return the content of the file. You can then flatten that result to get an RDD of the files' contents.