Search code examples
linuxnetwork-programminghard-drivenfs

Process Many Files Concurrently — Copy Files Over or Read Through NFS?


I need to concurrently process a large amount of files (thousands of different files, with avg. size of 2MB per file).

All the information is stored on one (1.5TB) network hard drive, and will be processed by about 30 different machines. For efficiency, each machine will be reading (and processing) different files (there are thousands of files that need to be processed).

Every machine -- following its reading of a file from the 'incoming' folder on the 1.5TB hard drive -- will be processing the information and be ready to output the processed information back to the 'processed' folder on the 1.5TB drive. the processed information for every file is of roughly the same average size as the input files (about ~2MB per file).

What is the better thing to do:

(1) For every processing machine M, Copy all files that will be processed by M into its local hard drive, and then read & process the files locally on machine M.

(2) Instead of copying the files to every machine, every machine will access the 'incoming' folder directly (using NFS), and will read the files from there, and then process them locally.

Which idea is better? Are there any 'do' and 'donts' when one is doing such a thing?

I am mostly curious if it is a problem to have 30 machines or so read (or write) information to the same network drive, at the same time?

(note: existing files will only be read, not appended/written; new files will be created from scratch, so there are no issues of multiple access to the same file...). Are there any bottlenecks that I should expect?

(I am use Linux, Ubuntu 10.04 LTS on all machines if it all matters)


Solution

  • I would definitely do #2 - and I would do it as follows:

    Run Apache on your main server with all the files. (Or some other HTTP server, if you really want). There are several reason's I'd do it this way:

    1. HTTP is basically pure TCP (with some headers on it). Once the request is sent - it's a very "one-way" protocol. Low overhead, not chatty. High performance and efficiency - low overhead.

    2. If you (for whatever reason) decided you needed to move or scale it out (using a could service, for example) HTTP would be a much better way to move the data around over the open Internet, than NFS. You could use SSL (if needed). You could get through firewalls (if needed). etc..etc..etc...

    3. Depending on the access pattern of your file, and assuming the whole file is required to be read - it's easier/faster just to do one network operation - and pull the whole file in in one whack - rather than to constantly request I/Os over the network every time you're reading a smaller piece of the file.

    4. It could be easy to distribute and run an application that does all this - and doesn't rely on the existance of network mounts - specific file paths, etc. If you have the URL to the files - the client can do it's job. It doesn't need to have established mounts, hard directory - or to become root to set-up such mounts.

    5. If you have NFS connectivity problems - the whole system can get whacky when you try to access the mounts and they hang. With HTTP running in a user-space context - you just get a timeout error - and your application can take whatever action it chooses (like page you - log errors, etc).