Search code examples
regexlinuxawksedgnu-sed

Regex for whitespace delimiter except for [ and ] characters


I consider my self pretty good with regular expressions, but this one is appearing to be surprisingly tricky.

I want to trim all whitespace, except the ones between "" and [] characters.

I used this regex ("[^"]*"|\S+)\s+ but did split the [06/Jan/2021:17:50:09 +0300] part of my log into two blocks.

Here is my entire log line :

[06/Jan/2021:17:50:09 +0300] "" 10.139.3.194 407 "CONNECT clients5.google.com:443 HTTP/1.1" "" "-" "" 4245 75 "" "" "81" ""

Result I am getting based on my regex using sed command (replacing whitespace by comma):

[06/Jan/2021:17:50:09,+0300],"",10.139.3.194,407,"CONNECT clients5.google.com:443 HTTP/1.1","","-","",4245,75,"","","81",""

Finally the result that I want to have :

[06/Jan/2021:17:50:09 +0300],"",10.139.3.194,407,"CONNECT clients5.google.com:443 HTTP/1.1","","-","",4245,75,"","","81",""

Solution

  • Since these samples input looks like logs, so considering they will be always in same format; with this you could try following awk code, written and tested in shown samples in GNU awk.

    awk -v FPAT='[^]]*\\]|"[^"]*"|([0-9]+\\.){3}[0-9]+|[0-9]{2,4}' -v OFS="," '{$1=$1} 1'  Input_file
    

    Explanation:

    • Simple explanation would be using GNU awk here. Which has FPAT option available in it.
    • Option to set field separators in regex form. It matches things as per mentioned regex in FPAT and makes fields accordingly per line.
    • Then setting OFS(output field separator) as , also for all lines.
    • In main program of awk resetting line(by resetting 1st field) to apply OFS value to it as per OP's requirement. Which will make sure that commas should come in output as per need only.

    Explanation of regex:

    [^]]*\\]               ##Matching everything till ] followed by ] here.
    |                      ##OR
    "[^"]*"                ##Matching from " till first occurrence of " everything between them including "
    |                      ##OR
    ([0-9]+\\.){3}[0-9]+   ##Matching digits followed by dot 3 times followed by digits
    |                      ##OR
    [0-9]{2,4}             ##Matching 2 to 4 digits here.