We have a system that receives archives on a specified directory and on a regular basis it launches a mapreduce job that opens the archives and processes the files within them. To avoid re-processing the same archives the next time, we're hooked into the close() method on our RecordReader to have it deleted after the last entry is read.
The problem with this approach (we think) is that if a particular mapping fails, the next mapper that makes another attempt at it finds that the original file has been deleted by the record reader from the first one and it bombs out. We think the way to go is to hold off until all the mapping and reducing is complete and then delete the input archives.
Is this the best way to do this?
If so, how can we obtain a listing of all the input files found by the system from the main program? (we can't just scrub the whole input dir, new files may be present)
i.e.:
. . .
job.waitForCompletion(true);
(we're done, delete input files, how?)
return 0;
}
Based on the situation you are explaining I can suggest the following solution:- 1.The process of data monitoring I.e monitoring the directory into which the archives are landing should be done by a separate process. That separate process can use some metadata table like in mysql to put status entries based on monitoring the directories. The metadata entries can also check for duplicacy. 2. Now based on the metadata entry a separate process can handle the map reduce job triggering part. Some status could be checked in metadata for triggering the jobs.