I working on a project which works on a very large amount of data. I have a lot(thousands) of zip files, each containing ONE simple txt file with thousands of lines(about 80k lines). What I am currently doing is the following:
for(File zipFile: dir.listFiles()){
ZipFile zf = new ZipFile(zipFile);
ZipEntry ze = (ZipEntry) zf.entries().nextElement();
BufferedReader in = new BufferedReader(new InputStreamReader(zf.getInputStream(ze)));
...
In this way I can read the file line by line, but it is definetely too slow. Given the large number of files and lines that need to be read, I need to read them in a more efficient way.
I have looked for a different approach, but I haven't been able to find anything. What I think I should use are the java nio APIs intended right for intensive I/O operations, but I don't know how to use them with zip files.
Any help would really be appreciated.
Thanks,
Marco
I have a lot(thousands) of zip files. The zipped files are about 30MB each, while the txt inside the zip file is about 60/70 MB. Reading and processing the files with this code takes a lot of hours, around 15, but it depends.
Let's do some back-of-the-envelope calculations.
Let's say you have 5000 files. If it takes 15 hours to process them, this equates to ~10 seconds per file. The files are about 30MB each, so the throughput is ~3MB/s.
This is between one and two orders of magnitude slower than the rate at which ZipFile
can decompress stuff.
Either there's a problem with the disks (are they local, or a network share?), or it is the actual processing that is taking most of the time.
The best way to find out for sure is by using a profiler.