Search code examples
xmlbashperlxml-parsinggpx

Parsing XML file with duplicate tags


I currently use an XML parser to extract the name of a route from a GPX (XML) file.

Each GPX files contains a single "name" tag which is what I've been extracting.

Here's the script:

#! /bin/bash

gpxpath=/mnt/gpxfiles; export gpxpath

for file in $gpxpath/*
do

filename=`ls $file`; export filenanme
gpxname=`$scripts/xmlparse.pl "$file"`

echo $filename "    "$gpxname >> gpxparse.tmp

done

sort -k 2,2 gpxparse.tmp > gpxparse.out

cat gpxparse.out

And here's xmlparse.pl:

#!/usr/bin/perl

use strict;
use warnings;

use XML::Twig;

XML::Twig->new(
    twig_handlers => {
        'name' => sub { print $_ ->text }
    }
    )->parse( <> );

Here's an example GPX file:

<?xml version="1.0" encoding="UTF-8"?>
<gpx version="1.1" creator="creator" xsi:schemaLocation="http://www.topografix.com/GPX/1/1 http://www.topografix.com/GPX/1/1/gpx.xsd" xmlns="http://www.topografix.com/GPX/1/1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <metadata>     
        <referrer>Referrer</referrer>
        <time>2019-06-17T06:02:23.000Z</time>
    </metadata>
    <trk>
        <name>Another GPX file</name>
        <trkseg>
            <trkpt lon="-1.91990" lat="53.00131">
                <ele>112.1</ele>
                <time>2019-06-17T06:02:23.000Z</time>
            </trkpt>
            <trkpt lon="-1.91966" lat="53.00126">
                <ele>113.6</ele>
                <time>2019-06-17T06:02:25.000Z</time>
            </trkpt>
            <trkpt lon="-1.91962" lat="53.00125">
                <ele>114.1</ele>
                <time>2019-06-17T06:02:25.000Z</time>
            </trkpt>
            <trkpt lon="-1.91945" lat="53.00120">
                <ele>115.5</ele>
                <time>2019-06-17T06:02:26.000Z</time>
            </trkpt>
        </trkseg>
    </trk>
</gpx>

I can successfully extract the name of the route using the scripts above However, I'd additionally like to extract the first co-ordinate pair in each file.

Atrack can defined by a "trk" element and within a track can be multiple segments or "trkseg". Finally, within a trkseg are multiple "trkpt" (track points).

A track point usually consists of a latitdue and longitude co-ordinate pair along with elevation and timestamp information.

I'm only looking to extract the first lat and lon within the first trkpt of the GPX file. Ideally, once the script has found the first co-ordinate pair it should exit and move onto the next file.

I've tried crafting an additional perl script

I've added an additional perl parse script using XML::Twig but it seems to stumble when there are multiple elements with duplicate names.


Solution

  • Using to extract the "name" value and the lat and lon of the first trkpt:

    xmlstarlet sel -t -v '//_:name'          -o , \
                      -v '//_:trkpt[1]/@lat' -o , \
                      -v '//_:trkpt[1]/@lon' -n \
                      file.xml
    
    Another GPX file,53.00131,-1.91990
    

    In the shell script, you can parse this output with:

    IFS=, read -r gpxname lat long < <( xmlstarlet ... )