Search code examples
javaniojava-io

Best way to write huge number of files


I am writing a lots of files like bellow.

public void call(Iterator<Tuple2<Text, BytesWritable>> arg0)
        throws Exception {
    // TODO Auto-generated method stub

    while (arg0.hasNext()) {
        Tuple2<Text, BytesWritable> tuple2 = arg0.next();
        System.out.println(tuple2._1().toString());
        PrintWriter writer = new PrintWriter("/home/suv/junk/sparkOutPut/"+tuple2._1().toString(), "UTF-8");
        writer.println(new String(tuple2._2().getBytes()));
        writer.close();
    }
}

Is there any better way to write the files..without closing or creating printwriter every time.


Solution

  • There is no significantly better way to write lots of files. What you are doing is inherently I/O intensive.

    UPDATE - @Michael Anderson is right, I think. Using multiple threads to write the files (probably) will speed things up considerably. However, the I/O is still going to be the ultimate bottleneck from a couple of respects:

    • Creating, opening and closing files involves file & directory metadata access and update. This entails non-trivial CPU.

    • The file data and metadata changes need to be written to disc. That is possibly multiple disc writes.

    • There are at least 3 syscalls for each file written.

    • Then there are thread stitching overheads.

    Unless the quantity of data written to each file is significant (multiple kilobytes per file), I doubt that the techniques like using NIO, direct buffers, JNI and so on will be worthwhile. The real bottlenecks will be in the kernel: file system operations and low-level disk I/O.


    ... without closing or creating printwriter every time.

    No. You need to create a new PrintWriter ( or Writer or OutputStream ) for each file.

    However, this ...

      writer.println(new String(tuple2._2().getBytes()));
    

    ... looks rather peculiar. You appear to be:

    • calling getBytes() on a String (?),
    • converting the byte array to a String
    • calling the println() method on the String which will copy it, and the convert it back into bytes before finally outputting them.

    What gives? What is the point of the String -> bytes -> String conversion?

    I'd just do this:

      writer.println(tuple2._2());
    

    This should be faster, though I wouldn't expect the percentage speed-up to be that large.