Search code examples
htmlrparsingloggingaltiris

Parsing pseudo-HTML/XML logfiles into a data frame (Symantec Altiris) [R]


I've been asked to help parse some log files for a Symantec application (Altiris) and they were delivered to me in a pseudo-HTML/XML format. I've managed to use readLines() and grepl() to get the logs into a decent character vector format and clean out the junk, but can't get it into a data-frame.

As of right now, an entry looks something like this (since I can't post real data), all in a character vector with structure chr[1:312]:

[310] "<severity='4', hostname='computername125', source='PackageDownload', module='herpderp.dll', process='masterP.exe', pid='234' >"

I've had no luck with XML parsing and it does look more like HTML to me, and when I tried htmlTreeParse(x) I just ended up with a massive pyramid of tags.


Solution

  • If you're working with pseudo-XML, it's probably best to define the parsing rules yourself. I like stringr and dplyr for stuff like this.

    Here's a two-element vector (instead of 312 in your case):

    vec <- c(
      "<severity='4', hostname='computername125', source='PackageDownload', module='herpderp.dll', process='masterP.exe', pid='234' >",
      "<severity='5', hostname='computername126', source='PackageDownload', module='herpderp.dll', process='masterP.exe', pid='235' >"
    )
    

    Convert it to a data.frame object:

    df <- data.frame(vec, stringsAsFactors = FALSE)
    

    And select out your data based on their character index positions, relative to the positions of your variables of interest:

    require(stringr)
    require(dplyr)
    
    df %>%
      mutate(
        severityStr = str_locate(vec, "severity")[, "start"],
        hostnameStr = str_locate(vec, "hostname")[, "start"],
        sourceStr = str_locate(vec, "source")[, "start"],
        moduleStr = str_locate(vec, "module")[, "start"],
        processStr = str_locate(vec, "process")[, "start"],
        pidStr = str_locate(vec, "pid")[, "start"],
        endStr = str_locate(vec, ">")[, "start"],
        severity = substr(vec, severityStr + 10, hostnameStr - 4),
        hostname = substr(vec, hostnameStr + 10, sourceStr - 4),
        source = substr(vec, sourceStr + 8, moduleStr - 4),
        module = substr(vec, moduleStr + 8, processStr - 4),
        process = substr(vec, processStr + 9, pidStr - 4),
        pid = substr(vec, pidStr + 5, endStr - 3)) %>%
      select(severity, hostname, source, module, process, pid)
    

    Here's the resulting data frame:

      severity        hostname          source       module     process pid
    1        4 computername125 PackageDownload herpderp.dll masterP.exe 234
    2        5 computername126 PackageDownload herpderp.dll masterP.exe 235
    

    This solution is robust enough to handle string inputs of different lengths. For example, it would read pid in correctly even if it's 95 (two digits instead of three).