Search code examples
hadooppreprocessorhdfsflumedata-integration

Preprocessing and ingesting data in Hadoop


We have two types of logs:

1) SESSION LOG: SESSION_ID, USER_ID, START_DATE_TIME, END_DATE_TIME

2) EVENT LOG: SESSION_ID, DATE_TIME, X, Y, Z

We only need to store the event log, but would like to replace the SESSION_ID with its corresponding USER_ID. Which technologies (i.e. Flume?) should we use to store the data in HDFS?

Thanks!


Solution

  • Yes Flume can be used to move log files to HDFS.

    To replace SESSION_ID with USER_ID - you could:

    Do this using Shell Scripts - and generate 'Modified Event Log File' - This is what Flume will pick up. This would be the simplest approach.