Search code examples
javaregexscalalogparser

Null value if matching group does not match/work


Is any regex expression that return empty or null value if don't find match?

For example, I have regex which processes this log line:

May  5 23:00:01 99.99.99.99 %FRA-8-333344: Built inbound UDP connection 9999888811 for outside:11.111.111.11/47747 (11.111.111.11/47747) to net-9999:22.22.22.22/53 (22.22.22.22/53)

But sometimes logs are different for example one value is missing (example: connection id missing):

May  5 23:00:01 99.99.99.99 %FRA-8-333344: Built inbound UDP for outside:11.111.111.11/47747 (11.111.111.11/47747) to net-9999:22.22.22.22/53 (22.22.22.22/53)

My problem is that I want to handle this change, my idea is to return empty value if regex don't find value. My next step is to build hive table, for this reason values extract from regex must have right order, and for example UDP value cannot be written on connection id column.

Does anyone know solution of this problem? In R Language solution is very simple (str_extract_all) and array of regex expressions, but in Scala I Can't handle..

key-values from first log:

timestamp: May  5 23:00:01
Action: Built
protocol: UDP
connection_id: 9999888811
src_ip: 11.111.111.11
dst_ip:  22.22.22.22
src_port  47747
dst_port 53

key-values from second log:

timestamp: May  5 23:00:01
Action: Built
protocol: UDP
connection_id: **EMPTY/NULL/" "**
src_ip: 11.111.111.11
dst_ip:  22.22.22.22
src_port  47747
dst_port 53

For every help I will be grateful :)

UPDATE 28.06.2017

My regex: https://regex101.com/r/4mtAtu/1

My solution. I think it will be slow:

case class logValues(time_stamp: String, action: String, protocol: String, connection_id: String, ips: String, ports: String)


def matchLog(x: String): logValues = {

  val time_stamp =  """^.*?(?=\s\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\s%)""".r.findAllIn(x).mkString(",")
    val action = """((?<=:\s)\w{4,10}(?=\s\w{2})|(?<=\w\s)(\w{7,9})(?=\s[f]))""".r.findAllIn(x).mkString(",")
    val protocol = """(?<=[\w:]\s)(\w+)(?=\s[cr])""".r.findAllIn(x).mkString(",")
    val connection_id = """(?<=\w\s)(\d+)(?=\sfor)""".r.findAllIn(x).mkString(",")
    val ips = """(?<=[\d\w][:\s])(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})(?=\/\d+|\z| \w)""".r.findAllIn(x).mkString("|")
    val ports = """(?<=\d/)(\d{1,6})(?=\z|[\s(])""".r.findAllIn(x).mkString("|")

    val logObject = logValues(time_stamp, action, protocol, connection_id, ips, ports)

    return logObject
  }

Solution

  • You're compiling six different regex patterns and then submitting the the input string to the six different tests. A different approach is to create a single regex for the entire log line and extract the desired info via capture groups.

    You'll have to tweak this since you know what parts are variant/invariant and I only have two example log lines to work with.

    val logPattern =
      raw"^(.*)\s"                                    + // timestamp
      raw"\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\s%\S+\s" +
      raw"(\w+)?\s\w+\s"                              + // action
      raw"(\w+)?\s\w*\s*"                             + // protocol
      raw"(\d+)?\s.*outside:"                         + // connection ID
      raw"(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})/"      + // src IP
      raw"(\d+).*:"                                   + // src port
      raw"(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})/"      + // dst IP
      raw"(\d+)"                                        // dst port
    
    val logRE = logPattern.r.unanchored  // only once
    

    The upside: more efficient and everything is in one place. The downside: the whole pattern can fail if one section is incorrect. (Note: Compile the regex pattern only once. Not every time you pass in a new log line.)

    The extraction is now more direct.

    log_line match {
      case logRE(ts,act,ptcl,cid,sip,sprt,dip,dprt) =>
        LogValues(ts,act,ptcl,cid,s"$sip/$dip",s"$sprt/$dprt")
      case _ => /* log line doesn't fit pattern */
    }
    

    You'll note I made three fields optional: action, protocol, and connection ID. Optional capture groups that don't capture anything return null and while it's OK for String values to be null, it's not considered good practice. Much better to use Option[String] instead. And while we're at it, since the whole log line might fail the pattern recognition, let's make the return type optional as well.

    case class LogValues( time_stamp    : String
                        , action        : Option[String]
                        , protocol      : Option[String]
                        , connection_id : Option[String]
                        , ips           : String
                        , ports         : String
                        )
    
    log_Line match {
      case logRE(ts,act,ptcl,cid,sip,sprt,dip,dprt) =>
        Some(LogValues( ts
                      , Option(act)
                      , Option(ptcl)
                      , Option(cid)
                      , s"$sip/$dip"
                      , s"$sprt/$dprt" ))
      case _ => /* log line doesn't fit pattern */
        None
    }