Search code examples
logstashlogstash-grok

How do I match spark log pattern in grok


I'm actually working on parsing some logs data and tried to implement a grok parser for spark logs.

Actually, this is one output from spark logs:

14/04/14 18:51:52 INFO Client: Command for the ApplicationMaster: $JAVA_HOME/bin/java -server -Xmx640m -Djava.io.tmpdir=$PWD/tmp org.apache.spark.deploy.yarn.ApplicationMaster --class SimpleApp --jar ./spark-example-1.0.0.jar --args 'yarn-standalone' --worker-memory 1024 --worker-cores 1 --num-workers 3 1> <LOG_DIR>/stdout 2> <LOG_DIR>/stderr

And this is the grok filter I tried before:

(?<logtime>\d{2}/\d{2}/\d{2} \d{2}:\d{2}:\d{2})%{SPACE}%{LOGLEVEL:level}%{SPACE}%{WORD:srcclass}:%{SPACE}%{GREEDYDATA:data}"

This does not work for me. Can someone help me?

Many thanks!


Solution

  • You're almost there. The only issue you have is the double quote at the end of your grok pattern, if you remove it you'll be fine. Also you don't need the %{SPACE} patterns unless you want to capture those spaces.

    This is what worked for me:

    (?<logtime>\d{2}/\d{2}/\d{2} \d{2}:\d{2}:\d{2}) %{LOGLEVEL:level} %{WORD:srcclass}: %{GREEDYDATA:data}
    

    It will produce this

    {
      "logtime": [
        [
          "14/04/14 18:51:52"
        ]
      ],
      "level": [
        [
          "INFO"
        ]
      ],
      "srcclass": [
        [
          "Client"
        ]
      ],
      "data": [
        [
          "Command for the ApplicationMaster: $JAVA_HOME/bin/java -server -Xmx640m -Djava.io.tmpdir=$PWD/tmp org.apache.spark.deploy.yarn.ApplicationMaster --class SimpleApp --jar ./spark-example-1.0.0.jar --args 'yarn-standalone' --worker-memory 1024 --worker-cores 1 --num-workers 3 1> <LOG_DIR>/stdout 2> <LOG_DIR>/stderr"
        ]
      ]
    }