I need to watch a specific folder for new files and whenever new file arrives, i need to perform some processing and processed data to one of the indexing software.
All i need to do is, watch the folder and whenever a new file comes in, i need to read the contents of it. Flume spooling directory looks good fit, but here are the challenges i am thinking.
1) Reading the file only once and should not read any file that is already read. 2) Completeness of a file, for eg: if the file has not been copied fully lets say .staging or .tmp files are there, i should not read them. 3) The input files can be of huge size and they are xmls. So, reading file in splits does not help my cause. I need to read file in full, and process them. 4) As the size of file might be huge, flume seems to have some problems with huge files. Can it fit into my requirement.? or should i check for any other file watchers.?
Could you please suggest best option to perform the file watching. Is flume spooling does all this.?
I can't say anything about flume, I am unfamiliar with it.
You can do one of a couple of things.
First, you could copy the files in to the directory using one type of name (like newfile.copying), and then rename them to just "newfile" after the copy is complete. Then during you scans, you simply ignore the "*.copying" files.
You could monitor the file sizes of the files as they load, and if the file size has not changed after some time (few seconds), then you can assume the file is done copying and start processing.
Finally, you should simply have a "done" directory (on the same drive), and rename the files to that directory when you're done with them.
Another option is that you could have three directories: "incoming", "working", "done".
The files are copied in to the "incoming" directory. Before you start processing them, you rename them to the "working" directory. Finally you then move it out of there in to the "done" directory.
This gives you the ability to recover in case the system gets interrupted. You will "know" what the last file you were processing is, and you can either reprocess it, or whatever you like.
The rename options are important because, on the same file system, they are atomic. You'll never have a file in one directory and not the other, or of one name and the other at the same time.