Search code examples
regexfluentd

Fluentd Parsing


Hi i'm trying to parse single line log using fluentd. Here is log i'm trying to parse.

F2:4200000000000000,F3:000000,F4:000000060000,F6:000000000000,F7:000000000,F8..........etc

This will parse into like this:

{ "F2"   : "4200000000000000", "F3" : "000000", "F4"   : "000000060000" ............etc }

I tried to use regex but it's confusing and making me write multiple regexes for different keys and values. Is there any easier way to achieve this ?

EDIT1: Heya! I will make this more detailed. I'm currently tailing logs using fluentd to Elasticsearch+Kibana. Here is unparsed example log that fluentd sending to Elasticsearch:

21/09/02 16:36:09.927238: 1 frSMS:0:13995:#HTF4J::141:141:msg0210,00000000000000000,000000,000000,007232,00,#,F2:00000000000000000,F3:002000,F4:000000820000,F6:Random message and strings,F7:.......etc
  • Elasticsearch recived message:

{"message":"frSMS:0:13995:#HTF4J::141:141:msg0210,00000000000000000,000000,000000,007232,00,#,F2:00000000000000000,F3:002000,F4:000000820000,F6:Random digits and chars,F7:.......etc"}

This log has only message key so i can't index and create dashboard on only using whole message field. What am i trying to achieve is catch only useful fields, add key into it if it has no key and make indexing easier.

  • Expected output:
{"logdate" : "21/09/02 16:36:09.927238",
     "source" : "frSMS",
     "UID" : "#HTF4J",
     "statuscode" : "msg0210",
     "F2": "00000000000000000",
     "F3": "randomchar314516",.....}

I used regex plugin to parse into this but it was too overwhelming and . Here is what i did so far:

^(?<logDate>\d{2}.\d{2}.\d{2}\s\d{2}:\d{2}:\d{2}.\d{6}\b)....(?<source>fr[A-Z]{3,4}|to[A-Z]{3,4}\b).(?<status>\d\b).(?<dummyfield>\d{5}\b).(?<HUID>.[A-Z]{5}\b)..(?<d1>\d{3}\b).(?<d2>\d{3}\b).(?<msgcode>msg\d{4}\b).(?<dummyfield1>\d{16}\b).(?<dummyfield2>\d{6}\b).(?<dummyfield3>\d{6,7}\b).(?<dummyfield4>\d{6}\b).(?<dummyfield5>\d{2}\b)...

Which results to :

"logDate": "21/09/02 16:36:09.205706", "source": "toSMS" , "status": "0", "dummyfield": "13995" , "UID" : "#HTFAA" , "d1" : "156" , "d2" : "156" , "msgcode" : "msg0210", "dummyfield1" :"0000000000000000" , "dummyfield2" :"002000", "dummyfield3" :"2000000", "dummyfield4" :"00", "dummyfield5" :"2000000" , "dummyfield6" :"867202"

Which only applies to example log and has useless fields like field1, dummyfield, dummyfield1 etc. Other logs has the useful values and keys(date,source,msgcode,UID,F1,F2 fields) like i showcased on expected output. Not useful fields are not static(they can be none, or has less|more digits and chars) so they trigger the pattern not matched error.

So the question is:

  1. How do i capture useful fields that i mentioned using regex?
  2. How do i capture F1,F2,F3...... fields that has different value patterns like char string mixed?

PS: I wraped the regex i wrote into html snippet so the <> capturing fields don't get deleted


Solution

  • Regex pattern to use:

    (F[\d]+):([\d]+)
    

    This pattern will catch all the 'F' values with whatever digit that comes after - yes even if it's F105 it still works. This whole 'F105' will be stored as the first group in your regex match expression

    The right part of the above pattern will catch the value of all the digits following ':' up until any charachter that is not a digit. i.e. ',', 'F', etc.. and will store it as the second group in your regex match

    Use

    Depending on your coding language you will have to access your regex matches variable with an iterator and extract group 1 and group 2 respectivly

    Python example:

    import re
    
    log = 'F2:4200000000000000,F3:000000,F4:000000060000,F6:000000000000,F7:000000000,F105:9726450'
    pattern = '(F[\d]+):([\d]+)'
    matches = re.finditer(pattern,log)
    log_dict = {}
    for match in matches:
        log_dict[match.group(1)] = match.group(2) 
    print(log_dict)
    

    Output

    {'F2': '4200000000000000', 'F3': '000000', 'F4': '000000060000', 'F6': '000000000000', 'F7': '000000000', 'F105': '9726450'}