Is any regex
expression that return empty or null value if don't find match?
For example, I have regex which processes this log line:
May 5 23:00:01 99.99.99.99 %FRA-8-333344: Built inbound UDP connection 9999888811 for outside:11.111.111.11/47747 (11.111.111.11/47747) to net-9999:22.22.22.22/53 (22.22.22.22/53)
But sometimes logs are different for example one value is missing (example: connection id missing):
May 5 23:00:01 99.99.99.99 %FRA-8-333344: Built inbound UDP for outside:11.111.111.11/47747 (11.111.111.11/47747) to net-9999:22.22.22.22/53 (22.22.22.22/53)
My problem is that I want to handle this change, my idea is to return empty value if regex
don't find value. My next step is to build hive table, for this reason values extract from regex
must have right order, and for example UDP value cannot be written on connection id column.
Does anyone know solution of this problem? In R Language solution is very simple (str_extract_all) and array of regex expressions, but in Scala I Can't handle..
key-values from first log:
timestamp: May 5 23:00:01
Action: Built
protocol: UDP
connection_id: 9999888811
src_ip: 11.111.111.11
dst_ip: 22.22.22.22
src_port 47747
dst_port 53
key-values from second log:
timestamp: May 5 23:00:01
Action: Built
protocol: UDP
connection_id: **EMPTY/NULL/" "**
src_ip: 11.111.111.11
dst_ip: 22.22.22.22
src_port 47747
dst_port 53
For every help I will be grateful :)
UPDATE 28.06.2017
My regex: https://regex101.com/r/4mtAtu/1
My solution. I think it will be slow:
case class logValues(time_stamp: String, action: String, protocol: String, connection_id: String, ips: String, ports: String)
def matchLog(x: String): logValues = {
val time_stamp = """^.*?(?=\s\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\s%)""".r.findAllIn(x).mkString(",")
val action = """((?<=:\s)\w{4,10}(?=\s\w{2})|(?<=\w\s)(\w{7,9})(?=\s[f]))""".r.findAllIn(x).mkString(",")
val protocol = """(?<=[\w:]\s)(\w+)(?=\s[cr])""".r.findAllIn(x).mkString(",")
val connection_id = """(?<=\w\s)(\d+)(?=\sfor)""".r.findAllIn(x).mkString(",")
val ips = """(?<=[\d\w][:\s])(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})(?=\/\d+|\z| \w)""".r.findAllIn(x).mkString("|")
val ports = """(?<=\d/)(\d{1,6})(?=\z|[\s(])""".r.findAllIn(x).mkString("|")
val logObject = logValues(time_stamp, action, protocol, connection_id, ips, ports)
return logObject
}
You're compiling six different regex patterns and then submitting the the input string to the six different tests. A different approach is to create a single regex for the entire log line and extract the desired info via capture groups.
You'll have to tweak this since you know what parts are variant/invariant and I only have two example log lines to work with.
val logPattern =
raw"^(.*)\s" + // timestamp
raw"\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\s%\S+\s" +
raw"(\w+)?\s\w+\s" + // action
raw"(\w+)?\s\w*\s*" + // protocol
raw"(\d+)?\s.*outside:" + // connection ID
raw"(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})/" + // src IP
raw"(\d+).*:" + // src port
raw"(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})/" + // dst IP
raw"(\d+)" // dst port
val logRE = logPattern.r.unanchored // only once
The upside: more efficient and everything is in one place. The downside: the whole pattern can fail if one section is incorrect. (Note: Compile the regex pattern only once. Not every time you pass in a new log line.)
The extraction is now more direct.
log_line match {
case logRE(ts,act,ptcl,cid,sip,sprt,dip,dprt) =>
LogValues(ts,act,ptcl,cid,s"$sip/$dip",s"$sprt/$dprt")
case _ => /* log line doesn't fit pattern */
}
You'll note I made three fields optional: action
, protocol
, and connection ID
. Optional capture groups that don't capture anything return null
and while it's OK for String
values to be null
, it's not considered good practice. Much better to use Option[String]
instead. And while we're at it, since the whole log line might fail the pattern recognition, let's make the return type optional as well.
case class LogValues( time_stamp : String
, action : Option[String]
, protocol : Option[String]
, connection_id : Option[String]
, ips : String
, ports : String
)
log_Line match {
case logRE(ts,act,ptcl,cid,sip,sprt,dip,dprt) =>
Some(LogValues( ts
, Option(act)
, Option(ptcl)
, Option(cid)
, s"$sip/$dip"
, s"$sprt/$dprt" ))
case _ => /* log line doesn't fit pattern */
None
}