Search code examples
regexparsingdask

How to identify all space(s) as separators until a doublepoint is passed 2 times


I am trying to find a regular expression which should satisfy the following needs.

It should identify all space(s) as separators until a doublepoint is passed 2 times. After this pass, it should continue to use spaces as separators until a 3rd doublepoint is identified. This 3rd colon should be used as separator as well. But all spaces before and after this specific colon should not be used as separator. After this special doublepoint has been identified, no more separator should be found even its a space or a colon.

2019-12-28 13:00:00.112 DEBUG n-somethingspecial.at --- [9999-118684] 3894ß8349ß84930ßaa14e38eae18e3ebf c.w.f.w.NiceController             : z rest as async texting: json, special character, spacses.....

I would like to have the separators her identified as following (Separator shown as X)

2019-12-28X13:00:00.112XDEBUGXn-somethingspecial.atX---X[9999-118684]X3894ß8349ß84930ßaa14e38eae18e3ebfXc.w.f.w.NiceControllerXz rest as async texting: json, special character, spacses.....


2019-12-28 X 13:00:00.112 X DEBUG X n-somethingspecial.at X --- X [9999-118684] X 3894ß8349ß84930ßaa14e38eae18e3ebf X c.w.f.w.NiceController X z rest as async texting: json, special character, spacses.....

Exactly 8 separtors are found here.

Any ideas how to do this via regular expression?

My current approach does not work as I tried to to this like the following

(?<=\d{4}-\d{2}-\d{2})\s|(?<=\d{2}:\d{2}:\d{2}\.\d{3})\s|(?<=DEBUG)\s|(?<=\s---)\s|(?<=WARN)\s|(?<=ERROR)\s|(?<=\[[0-9a-z\#\.\-]{15}\])\s|((?<=\[[0-9a-z\#\.\-]{15}\]\s)\s|(?<=\[[0-9a-z\#\.\-]{15}\]\s[a-z0-9]{32})\s)|\s(?=---)|(?<=[a-zA-Z])\s+\:\s

That's my current syntax to identify the separators. Regex above is faulty.

(?<=\d{4}-\d{2}-\d{2})\s|(?<=\d{2}:\d{2}:\d{2}\.\d{3})\s|(?:(?<=DEBUG)\s|(?<=WARN)\s|(?<=ERROR)\s|(?<=INFO)\s)|(?<=(?:p|t)-.{7}\-.{5}\.domain\.sys)\s|(?<=\s---)\s|(?<=\[[\s0-9a-z\#\.\-]{15}\])\s|(?:(?<=\[[\s0-9a-z\#\.\-]{15}\]\s)\s|(?<=[a-z0-9]{32})\s)|\s+\:\s(?<=[\sa-z]{1}\s{1}\:\s{1})

This is the current regex. Targetapproach is to call

            df = pd.read_csv(file_name,
                         sep="(?<=\d{4}-\d{2}-\d{2})\s|(?<=\d{2}:\d{2}:\d{2}\.\d{3})\s|(?:(?<=DEBUG)\s|(?<=WARN)\s|(?<=ERROR)\s|(?<=INFO)\s)|(?<=(?:p|t)-.{7}\-.{5}\.domain\.sys)\s|(?<=\s---)\s|(?<=\[[\s0-9a-z\#\.\-]{15}\])\s|(?:(?<=\[[\s0-9a-z\#\.\-]{15}\]\s)\s|(?<=[a-z0-9]{32})\s)|\s+\:\s(?<=[\sa-z]{1}\s{1}\:\s{1})",
                         names=['date', 'time', 'level', 'host', 'template', 'threadid', 'logid', 'classmethods', 'line'],
                         engine='python',
                         nrows=100)

This could be extended later to dask which gives me the change to parse multiple log files in one dataframe.

The last column line is not identified correctly. For unknown reasons yet.


Solution

  • Solution

    The current outcome of my problem can be solved via the following regular expression.

     (?:(?<=\d{4}-\d{2}-\d{2})\s|(?<=\d{2}:\d{2}:\d{2}\.\d{3})\s|(?:(?<=DEBUG)\s|(?<=WARN)\s|(?<=ERROR)\s|(?<=INFO)\s)|(?<=(?:p|t)-.{7}\-.{5}\.hostname\.sys)\s|(?<=\s---)\s|(?<=\[[\s0-9a-z\#\.\-]{15}\])\s|(?:(?<=\[[\s0-9a-z\#\.\-]{15}\]\s)\s|(?<=[a-z0-9]{32})\s))|\s+\:\s(?<=[\sa-z]{1}\s{1}\:\s{1})
    

    Maybe minor adaptions have to be done maybe but for now it works pretty good.