What is the best way to ingest a log file into HDFS while it is being written ? I am trying to configure Apache Flume, and am trying to configure sources that can offer me data reliability as well. I was trying to configure "exec" and later also looked at "spooldir" but the following documentation at flume.apache.org has put doubt on my own intent -
Exec Source:
One of the most commonly requested features is the use case like- "tail -F file_name" where an application writes to a log file on disk and Flume tails the file, sending each line as an event. While this is possible, there’s an obvious problem; what happens if the channel fills up and Flume can’t send an event? Flume has no way of indicating to the application writing the log file, that it needs to retain the log or that the event hasn’t been sent for some reason. Your application can never guarantee data has been received when using a unidirectional asynchronous interface such as ExecSource!
Spooling Directory Source:
Unlike the Exec source, "spooldir" source is reliable and will not miss data, even if Flume is restarted or killed. In exchange for this reliability, only immutable files must be dropped into the spooling directory. If a file is written to after being placed into the spooling directory, Flume will print an error to its log file and stop processing.
Anything better is available that I can use to ensure Flume will not miss any event and also reads in realtime ?
I would recommend using the Spooling Directory Source, because of its reliability. A workaround for the inmmutability requirement is to compose the files in a second directory, and once they reach certain size (in terms of bytes or amount of logs), move them to the spooling directory.