I've been asked to help parse some log files for a Symantec application (Altiris) and they were delivered to me in a pseudo-HTML/XML format. I've managed to use readLines()
and grepl()
to get the logs into a decent character vector format and clean out the junk, but can't get it into a data-frame.
As of right now, an entry looks something like this (since I can't post real data), all in a character vector with structure chr[1:312]
:
[310] "<severity='4', hostname='computername125', source='PackageDownload', module='herpderp.dll', process='masterP.exe', pid='234' >"
I've had no luck with XML parsing and it does look more like HTML to me, and when I tried htmlTreeParse(x)
I just ended up with a massive pyramid of tags.
If you're working with pseudo-XML, it's probably best to define the parsing rules yourself. I like stringr
and dplyr
for stuff like this.
Here's a two-element vector (instead of 312 in your case):
vec <- c(
"<severity='4', hostname='computername125', source='PackageDownload', module='herpderp.dll', process='masterP.exe', pid='234' >",
"<severity='5', hostname='computername126', source='PackageDownload', module='herpderp.dll', process='masterP.exe', pid='235' >"
)
Convert it to a data.frame
object:
df <- data.frame(vec, stringsAsFactors = FALSE)
And select out your data based on their character index positions, relative to the positions of your variables of interest:
require(stringr)
require(dplyr)
df %>%
mutate(
severityStr = str_locate(vec, "severity")[, "start"],
hostnameStr = str_locate(vec, "hostname")[, "start"],
sourceStr = str_locate(vec, "source")[, "start"],
moduleStr = str_locate(vec, "module")[, "start"],
processStr = str_locate(vec, "process")[, "start"],
pidStr = str_locate(vec, "pid")[, "start"],
endStr = str_locate(vec, ">")[, "start"],
severity = substr(vec, severityStr + 10, hostnameStr - 4),
hostname = substr(vec, hostnameStr + 10, sourceStr - 4),
source = substr(vec, sourceStr + 8, moduleStr - 4),
module = substr(vec, moduleStr + 8, processStr - 4),
process = substr(vec, processStr + 9, pidStr - 4),
pid = substr(vec, pidStr + 5, endStr - 3)) %>%
select(severity, hostname, source, module, process, pid)
Here's the resulting data frame:
severity hostname source module process pid
1 4 computername125 PackageDownload herpderp.dll masterP.exe 234
2 5 computername126 PackageDownload herpderp.dll masterP.exe 235
This solution is robust enough to handle string inputs of different lengths. For example, it would read pid
in correctly even if it's 95
(two digits instead of three).