Search code examples
xpathsplunk

Using Splunk - extract fields from xml data in a log file using xpath


I am using Splunk to extract a number of fields from xml data this is contained in a log file. So to limit the search to be MOSTLY the xml file I start the search with this: sourcetype="name of type here" "RULE"

This returns:

0123459 TripMessage.createMessage MsgSource <?xml version="1.0" encoding="UTF-8"?>
<tmsTrip xmlns="<link to trip" xmlns:ns2="link to common">

...

The file is very large. This is part of it.

<?xml version="1.0" encoding="UTF-8"?>
<tmsTrip xmlns="trip" xmlns:ns2="Common">
   <recordType>PURCHASEDLINEHAUL</recordType>
   <eventType>APPROVE</eventType>
.... more lines here that are not important
     <purchasedCost>
      <purchasedCostTripSegment>
         <purchCostReference>1587040</purchCostReference>
         <carrier information>
         <origin>
            <contact information>
         </origin>

This query selects the xml part text in the logging file and some of the fields are extracted and I can add to a table. (not including the source and sourcetype..)

| xmlkv | table purchCostReference, eventType, carrier, billingMethod

But need more fields that are child elements within the xml data. One of them is the numberCode. I am trying to use xpath to extract these additional fields.

| xmlkv | xpath
"//tmsTrip/purchasedCost/purchasedCostTripSegment/origin/ns2:numberCode" outfield=Origin | table purchCostReference, eventType, carrier, billingMethod, Origin

But no Origin data is returned when I add the field to the table. There is no error. The Origin column is empty. enter image description here

UPDATE I think the problem is that I need to add the field parameter. The xml file is within a log text file. I limit the search to get the xml file but not only the xml. So I think xpath is struggling with the other text that is not xml.

UPDATE I tried creating an extracted field using the wizard of the xml file that is within the logging statement. The xml is huge and I can only select about 30% of it. If anyone is good at regex, maybe they can give me some pointers as to how to complete the regex command to get all of the xml. (I tried updating the props.conf file but do not have permission to add TRUNCATE = 0). This is the xml file sample:

<?xml version="1.0" encoding="UTF-8"?>
<tmsTrip xmlns="trip" xmlns:ns2="Common">
   <recordType>PURCHASEDLINEHAUL</recordType>
   <eventType>APPROVE</eventType>
   <tripId>143642990</tripId>
   <legId>129014817</legId>
   <tripNumber>129014817</tripNumber>
   <tripLegNumber>1</tripLegNumber>
   <updatedDateGMT>2020-05-22T00:53:21.000Z</updatedDateGMT>
   <origin>
     <contact info>
   </origin>
   <destination>
     <contact info>
   </destination>
   <schedDispatchDateGMT>2020-05-22T13:00:00.000Z</schedDispatchDateGMT>
   <estimatedArrivalDateGMT>2020-05-26T06:00:00.000Z</estimatedArrivalDateGMT>
   <drop/>
   <hook/>
   <actualRoute>
     <info>
   </actualRoute>
   <standardRoute>
    <info>
   </standardRoute>
   <paidRoute>
     <info>
   </paidRoute>
   <settlement>
      <contact info>
               <ns2:daylightSavingsFlag>true</ns2:daylightSavingsFlag>
            </ns2:stopOffLocation>
         </stopOff>
         <schedDispatchDate>2020-05-22T05:00:00.000Z</schedDispatchDate>
         <estimatedArrivalDate>2020-05-26T01:00:00.000Z</estimatedArrivalDate>
         <billingMethod>LOCAL</billingMethod>
         <STCCCode>4711110</STCCCode>
         <planNumber>045</planNumber>
         <powerType>1X</powerType>
         <powerOnlyFlag>false</powerOnlyFlag>
      </purchasedCostTripSegment>
   </purchasedCost>
   <drivers/>
</tmsTrip>

This is how much the extracted field I can select:

<tmsTrip xmlns="trip" xmlns:ns2="Common"> PURCHASEDLINEHAUL 2020-05-22T00:53:21.000Z

This is the regex that Splunk creates to select the above xml

^[^\$\n]*\$\d+\.\w+\s+\w+\s+(?P<xmlMessage><\?\w+\s+\w+="\d+\.\d+"\s+\w+="\w+\-\d+"\?>\s+<\w+\s+\w+="\w+://\w+\.\w+\.\w+/\w+/\w+/\w+"\s+\w+:\w+="\w+://\w+\.\w+\.\w+/\w+/\w+/\w+">\s+<\w+>\w+</\w+>\s+<\w+>\w+</\w+>\s+<\w+>\d+</\w+>\s+<\w+>\d+</\w+>\s+<\w+>\d+</\w+>\s+<\w+>\d+</\w+>\s+<\w+>\d+\-\d+\-\d+\w+:\d+:\d+\.\d+\w+</\w+>\s+<\w+>\s+<\w+:\w+>\d+</\w+:\w+>\s+<\w+:\w+>\w+</\w+:\w+>\s+<\w+:\w+>\d+\s+\w+\s+\w+\s+\w+\s+\w+</\w+:\w+>\s+<\w+:\w+>\w+</\w+:\w+>\s+<\w+:\w+>\w+</\w+:\w+>\s+<\w+:\w+>\d+</\w+:\w+>\s+<\w+:\w+>\w+</\w+:\w+>\s+<\w+:\w+>\d+</\w+:\w+>\s+<\w+:\w+>\w+</\w+:\w+>\s+<\w+:\w+>\w+</\w+:\w+>)

So can I change the above regex to include the entire xml?

UPDATE I tried extracting a field from the xmlMessage extracted field. The xmlMessage field is above. I used the xpath command to extract recordType. Put the result in a table. This is the command

| xmlkv | xpath field=xmlMessage
"//tmsTrip/recordType" outfield=Origin | table Origin

It returned no results. This xpath command does not work for the simplest of queries. What am I doing wrong?


Solution

  • I was able to extract the data from the xml using rex and identify each instance of the numberCode using max_match and mvindex Here is an example of anyone who has this problem:

     rex max_match=0 "\<ns2\:numberCode\>(?P<location>[^\<]+)"| eval Segment1_Origin =  mvindex(location, 7)
    

    The xml element is ns2:numberCode. It is renamed to location. max_match=0 means unlimited number of instances. The mvindex is zero based. So the 8th instance of the location is set to the variable Segment1_origin