Search code examples
rregexjoinmultilinelogfile

Join multiline message from log file into a single row in R


How is it possible to join multiple lines of a log file into 1 dataframe row?

ADDED ONE LINE -- Example 4-line log file:

[WARN ][2016-12-16 13:43:10,138][ConfigManagerLoader] - [Low max memory=477102080. Java max memory=1000 MB is recommended for production use, as a minimum.]
[DEBUG][2016-05-26 10:10:22,185][DataSourceImpl] - [SELECT mr.lb_id,mr.lf_id,mr.mr_id FROM mr WHERE  ((                            mr.cap_em >
 0 AND             mr.cap_em > 5
 ))  ORDER BY mr.lb_id, mr.lf_id, mr.mr_id]
[ERROR][2016-12-21 13:51:04,710][DWRWorkflowService] - [Update Wizard - : [DWR WFR request error:
workflow rule = BenCommonResources-getDataRecords
    version = 2.0
    filterValues = [{"fieldName": "wotable_hwohtable.status", "filterValue": "CLOSED"}, {"fieldName": "wotable_hwohtable.status_clearance", "filterValue": "Goods Delivered"}]
    sortValues = [{"fieldName": "wotable_hwohtable.cost_actual", "sortOrder": -1}]
Result code = ruleFailed
Result message = Database error while processing request.
Result details = null
]]
[INFO ][2019-03-15 12:34:55,886][DefaultListableBeanFactory] - [Overriding bean definition for bean 'cpnreq': replacing [Generic bean: class [com.ar.moves.domain.bom.Cpnreq]; scope=prototype; abstract=false; lazyInit=false; autowireMode=0; dependencyCheck=0; autowireCandidate=true; primary=false; factoryBeanName=null; factoryMethodName=null; initMethodName=null; destroyMethodName=null; defined in URL [jar:file:/D:/Dev/404.jar!/com/ar/moves/moves-context.xml]] with [Generic bean: class [com.ar.bl.bom.domain.Cpnreq]; scope=prototype; abstract=false; lazyInit=false; autowireMode=0; dependencyCheck=0; autowireCandidate=true; primary=false; factoryBeanName=null; factoryMethodName=null; initMethodName=null; destroyMethodName=null; defined in URL [jar:file:/D:/Dev/Tools/Tomcatv8.5-appGit-master/404.jar!/com/ar/bl/bom/bl-bom-context.xml]]]

(See representative 8-line extract at https://pastebin.com/bsmWWCgw.)

The structure is clean:

[PRIOR][datetime][ClassName] - [Msg]

but the message is often multi-lined, there may be multiple brackets in the message itself (even trailing…), or ^M newlines, but not necessarily… That makes it difficult to parse. Dunno where to begin here…

So, in order to process such a file, and be able to read it with something like:

#!/usr/bin/env Rscript

df <- read.table('D:/logfile.log')

we really need to have that merge of lines happening first. How is that doable?

The goal is to load the whole log file for making graphics, analysis (grepping out stuff), and eventually writing it back into a file, so -- if possible -- newlines should be kept in order to respect the original formatting.

The expected dataframe would look like:

PRIOR   Datetime              ClassName             Msg
-----   -------------------   -------------------   ----------
WARN    2016-12-16 13:43:10   ConfigManagerLoader   Low max...
DEBUG   2016-05-26 10:10:22   DataSourceImpl        SELECT ...

And, ideally once again, this should be doable in R directly (?), so that we can "process" a live log file (opened in write mode by the server app), "à la tail -f".


Solution

  • This is a pretty wicked Regex bomb. I'd recommend using the stringr package, but you could do all this with grep style functions.

    library(stringr)
    
    str <- c(
      '[WARN ][2016-12-16 13:43:10,138][ConfigManagerLoader] - [Low max memory=477102080. Java max memory=1000 MB is recommended for production use, as a minimum.]
      [DEBUG][2016-05-26 10:10:22,185][DataSourceImpl] - [SELECT mr.lb_id,mr.lf_id,mr.mr_id FROM mr WHERE  ((                            mr.cap_em >
       0 AND             mr.cap_em > 5
       ))  ORDER BY mr.lb_id, mr.lf_id, mr.mr_id]
      [ERROR][2016-12-21 13:51:04,710][DWRWorkflowService] - [Update Wizard - : [DWR WFR request error:
      workflow rule = BenCommonResources-getDataRecords
          version = 2.0
          filterValues = [{"fieldName": "wotable_hwohtable.status", "filterValue": "CLOSED"}, {"fieldName": "wotable_hwohtable.status_clearance", "filterValue": "Goods Delivered"}]
          sortValues = [{"fieldName": "wotable_hwohtable.cost_actual", "sortOrder": -1}]
      Result code = ruleFailed
      Result message = Database error while processing request.
      Result details = null
      ]]'
    )
    

    Using regex we can split each line by checking for the pattern you mentioned. This regex is checking for a [, followed by any non-line feed character or line feed character or carriage return character, followed by a [. But do this is a lazy (non-greedy) way by using *?. Repeat that 3 times, then check for a -. Finally, check for a [, followed by any characters or a group that includes information within square brackets, then a ]. That's a mouthful. Type it into a regex calculator. Just remember to remove the extra backlashes (in a regex calculator \ is used but in R \\ is used).

    # Split the text into each line without using \n or \r.
    # pattern for each line is a lazy (non-greedy) [][][] - []
    linesplit <- str %>%
      # str_remove_all("\n") %>%
      # str_extract_all('\\[(.|\\n|\\r)+\\]')
      str_extract_all('\\[(.|\\n|\\r)*?\\]\\[(.|\\n|\\r)*?\\]\\[(.|\\n|\\r)*?\\] - \\[(.|\\n|\\r|(\\[(.|\\n|\\r)*?\\]))*?\\]') %>%
      unlist()
    
    linesplit # Run this to view what happened
    

    Now that we have each line separated break them into columns. But we don't want to keep the [ or ] so we use a positive lookbehind and a positive lookahead in the regex code to check to see if the are there without capturing them. Oh, and capture everything between them of course.

    # Split each line into columns
    colsplit <- linesplit %>% 
      str_extract_all("(?<=\\[)(.|\\n|\\r)*?(?=\\])")
    
    colsplit # Run this to view what happened
    

    Now we have a list with an object for each line. In each object are 4 items for each column. We need to convert those 4 items to a dataframe and then join those dataframes together.

    # Convert each line to a dataframe, then join the dataframes together
    df <- lapply(colsplit,
      function(x){
        data.frame(
          PRIOR = x[1],
          Datetime = x[2],
          ClassName = x[3],
          Msg = x[4],
          stringsAsFactors = FALSE
        )
        }
      ) %>%
      do.call(rbind,.)
    
    df
    #   PRIOR                Datetime           ClassName             Msg
    # 1 WARN  2016-12-16 13:43:10,138 ConfigManagerLoader Low max memory=
    # 2 DEBUG 2016-05-26 10:10:22,185      DataSourceImpl SELECT mr.lb_id
    # 3 ERROR 2016-12-21 13:51:04,710  DWRWorkflowService Update Wizard -
    
    # Note: there are extra spaces that probably should be trimmed,
    # and the dates are slightly messed up. I'll leave those for the
    # questioner to fix using a mutate and the string functions.
    

    I will leave it to you to fix the extra spaces, and date field.