Search code examples
javalarge-data-volumes

Read tens of thousands of Files and write to millions of files in Java


I am doing some unusual data manipulation. I have 36,000 input files. More then can be loaded into memory at once. I want to take the first byte of every file and put it in one output file, and then do this again for the second and so on. It does not need to be done in any specific order. Because the input files are compressed loading them takes bit longer, and they can't be read 1 byte at a time. I end up with a byte array of each input file.

The input files are about ~1-6MB uncompressed and ~.3-1MB compressed (lossy compression). The output files end up being the number of input files in bytes. ~36KB in my example.

I know the ulimit can be set on a Linux OS and the equivalent can be done on windows. Even though this number can be raised I don't think any OS will like millions of files being written to concurrently.

My current solution is to make 3000 or so bufferedwriter streams and loading each input file in turn and writing 1 byte to 3000 files and then closing the file and loading the next input. With this system each input file needs to be opened about 500 times each.

The whole operation takes 8 days to complete and is only a test case for a more practical application that would end up with larger input files, more of them, and more output files.

Catching all the compressed files in memory and then decompress them as needed does not sound practical, and would not scale to a larger input files.

I think the solution would be to buffer what I can from the input files(because memory constraints will not allow buffering it all), and then writing to files sequentially, and then doing it all over again.

However I do not know if there is a better solution using something I am not read up on.

EDIT I am grateful for the fast response. I know I was being vague in the application of what I am doing and I will try to correct that. I basically have a three dimensional array [images][X][Y] I want to iterate over every image and save each color from a specific pixel on every image, and do this for all images. The problems is memory constraints.

byte[] pixels = ((DataBufferByte) ImageIO.read( fileList.get(k) ).getRaster().getDataBuffer()).getData();

This is what I am using to load images, because it takes care of decompression and skipping the header.

I am not editing it as a video because I would have to get a frame, then turn it into an image (a costly color space conversion), and then convert it to a byte[] just to get pixel data int RGB color space.

I could load each image and split it into ~500 parts (size of Y) and write to separate files I leave open and write to for each image. The outputs would be easily under a gig. The resultant file could be loaded completely into memory and turned into an array for sequential file writing.

The intermediate steps does mean I could split the load up on a network but I am trying to get it done on a low quality laptop with 4gb ram, no GPU, and a low quality i7.

I had not considered saving anything to file as an intermediate step before reading davidbak's response. Size is the only thing making this problem not trivial and I now see the size can be divided into smaller more manageable chunks.


Solution

  • Three phase operation:

    Phase one: read all input files, one at a time, and write into a single output file. The output file will be record oriented - say, 8 byte records, 4 byte of "character offset", and 4 byte "character codepoint". As you're reading a file the character offset starts at 0, of course, so if the input file is "ABCD" you're writing (0, A) (1, B) (2, C) (3, D). Each input file is opened once, read sequentially and closed. The output file is opened once, written throughout sequentially, then closed.

    Phase two: Use an external sort to sort the 8 byte records of the intermediate file on the 4 byte character offset field.

    Phase three: Open the sorted intermediate file and make one pass through it. Open a new output file every time the character index field changes and write to that output file all the characters that belong to that index. Input file is opened once and read sequentially. Each output file is opened, written to sequentially, then closed.

    Voilà! You need space for the intermediate file, and a good external sort (and space for its work files).

    As @Jorge suggests, both phase 1 and phase 2 can be parallelized, and in fact, this sort of job as outlined (phases 1 to 3) is exactly in mapreduce/hadoop's sweet spot.