Search code examples
hadooplogparser

hadoop multiline mixed records


I would like to parse logfiles produced by fidonet mailer binkd, which are multi-line and much worse - mixed: several instances can write into one logfile, for example:

      27 Dec 16:52:40 [2484] BEGIN, binkd/1.0a-545/Linux -iq /tmp/binkd.conf
    + 27 Dec 16:52:40 [2484] session with 123.45.78.9 (123.45.78.9)
    - 27 Dec 16:52:41 [2484] SYS BBSName
    - 27 Dec 16:52:41 [2484] ZYZ First LastName
    - 27 Dec 16:52:41 [2484] LOC City, Country
    - 27 Dec 16:52:41 [2484] NDL 115200,TCP,BINKP
    - 27 Dec 16:52:41 [2484] TIME Thu, 27 Dec 2012 21:53:22 +0600
    - 27 Dec 16:52:41 [2484] VER binkd/0.9.6a-173/Win32 binkp/1.1
    + 27 Dec 16:52:43 [2484] addr: 2:1234/56.78@fidonet
    - 27 Dec 16:52:43 [2484] OPT NDA CRYPT
    + 27 Dec 16:52:43 [2484] Remote supports asymmetric ND mode
    + 27 Dec 16:52:43 [2484] Remote requests CRYPT mode
    - 27 Dec 16:52:43 [2484] TRF 0 0
    *+ 27 Dec 16:52:43 [1520] done (from 2:456/78@fidonet, OK, S/R: 0/0 (0/0 bytes))*
    + 27 Dec 16:52:43 [2484] Remote has 0b of mail and 0b of files for us
    + 27 Dec 16:52:43 [2484] pwd protected session (MD5)
    - 27 Dec 16:52:43 [2484] session in CRYPT mode
    + 27 Dec 16:52:43 [2484] done (from 2:1234/56.78@fidonet, OK, S/R: 0/0 (0/0 bytes))

So the logfile is not only multi-line with unpredictable number of lines per session, but also several records can be mixed in between, like session 1520 has finished in the middle of session 2484. What would be the right direction in hadoop to parse such a file? Or shall I just parse line-by-line and then merge them somehow into a record later and write those records into a SQL database using another set of jobs later on?

Thanks.


Solution

  • Right direction for Hadoop will be to develop your own input format who's record reader will read input line by line and produce logical records.
    Can be stated - that you actually can do it in mapper also - it might be a bit simpler. Drawback will be that it is not standard packaging of such code for hadoop and thus it is less reusable.

    Other direction you mentioned is not "natural" for hadoop in my view. Specifically - why to use all complicated (and expensive) machinery of shuffling to join together several lines which are already in hands.