Search code examples
mysqlbashperformanceimportgml-geographic-markup-lan

Improving efficiency of a bash script reading data from a large gml file


For various reasons, I have been tasked with importing data from a large gml file into a MySQL database. The gml files I am looking at are the ones at https://use-land-property-data.service.gov.uk/datasets/inspire/download.

Basically, all I am interested in is the coordinates of each object (polygon). I can import the coordinates, as they appear in the gml file (lat1 long1 lat2 long2...). I also need to include the maximum and minimum latitude and longitude as four separate columns (this needs to be calculated; it is not included in the file).

My attempt so far is as follows:

#!/bin/bash

gmlfile=$1
gmldir="/home/user.name/gml_files"

gml_short=$(echo $gmlfile | sed 's/\.gml//')

echo "INSERT IGNORE INTO gmldata.uk (region,coordinates,maxLat,maxLong,minLat,minLong) VALUES" > gml_sql_query.sql

xmllint --xpath "//*[local-name()='posList']" "$gmldir/$gmlfile" | cut -d">" -f2 | cut -d"<" -f1 > "$gmldir/$gml_short-coords.txt" #Dump all coordinates to coordinate file

while read coordinates; do

    latitude=$(echo $coordinates | tr ' ' '\n' | paste -d' ' - - | cut -d" " -f1) #odd elements
    longitude=$(echo $coordinates | tr ' ' '\n' | paste -d' ' - - | cut -d" " -f2) #even elements

    max_lat=$(echo "$latitude" | tr " " '\n' | sort -n | tail -1)
    min_lat=$(echo "$latitude" | tr " " '\n' | sort -n | head -1)
    max_long=$(echo "$longitude" | tr " " '\n' | sort -n | tail -1)
    min_long=$(echo "$longitude" | tr " " '\n' | sort -n | head -1)


    echo "('$gml_short','$coordinates','$max_lat','$max_long','$min_lat','$min_long')," >> gml_sql_query.sql

done < "$gmldir/$gml_short-coords.txt"

sed -i '$s/,$/;/' gml_sql_query.sql #Replace last , with ;

mysql < gml_sql_query.sql

The xmllint line and mysql import itself are both fast. The population of the "gml_sql_query.sql" file is really, really slow; around 1000 every 30 seconds. Every one of those gml files has tens of 1000s of coordinates, so this is taking a long, long time. Which parts of my script are the most inefficient, and is there any way to speed this up?

Two asides:

  1. I attempted to have lat_long and lat_process (etc) variables that did most of the piping and just did a final pipe for the first/second column or whatever. I didn't find that had any improvement.
  2. I tried using arrays, but since the values are decimals this was also slow:
mapfile -t lat_array <<< "$latitude"
    max_lat=0
    min_lat=9999999.0
    for num in "${lat_array[@]}"; do
        if [[ $(echo "$num > $max_lat" | bc -l) -eq 1 ]]; then
            max_lat=$num
        fi

        if [[ $(echo "$num < $min_lat" | bc -l) -eq 1 ]]; then
            min_lat=$num
        fi
    done

    mapfile -t long_array <<< "$longitude"
    max_long=0
    min_long=9999999.0
    for num in "${long_array[@]}"; do
        if [[ $(echo "$num > $max_long" | bc -l) -eq 1 ]]; then
            max_long=$num
        fi

        if [[ $(echo "$num < $min_long" | bc -l) -eq 1 ]]; then
            min_long=$num
        fi
    done

EDIT: example input data:

525873.122 104945.339 525873.13 104945.234 525872.916 104949.942 525873.308 104950.628 525874.582 104951.118 525875.268 104951.804 525875.301 104951.902 525871.972 104955.36 525847.95 104958.6 525824.152 104961.864 525821.7 104962.2 525798.7 104964.55 525778.14 104966.53 525773.63 104966.97 525778.15 104958.65 525781.863 104952.578 525783.98 104948.99 525797.06 104948.84 525819.82 104948.3 525845.38 104948.07 525856.52 104948.02 525871.5 104945.55 525871.69 104945.52 525872.5 104945.4 525873.122 104945.339
525661.75 105117.9 525685.085 105118.801 525684.99 105119.75 525684.5 105124.8 525684.47 105125.2 525681.819 105124.974 525680.35 105124.85 525671.99 105124.09 525664.24 105123.45 525661.29 105123.2 525661.75 105117.9
525685.1 105118.65 525686.35 105118.85 525698.949 105119.35 525700.4 105119.45 525701.699 105119.5 525703.15 105119.5 525709.949 105119.85 525709.29 105128.05 525707.25 105127.85 525699.55 105127.15 525697.9 105127.05 525689.8 105126.35 525685.819 105126 525684.686 105125.905 525684.406 105125.882 525684.47 105125.2 525684.5 105124.8 525684.99 105119.75 525685.1 105118.65

Desired MySQL insert statement:

INSERT IGNORE INTO gmldata.uk (region,coordinates,maxLat,maxLong,minLat,minLong) VALUES
('Adur_District_Council','525873.122 104945.339 525873.13 104945.234 525872.916 104949.942 525873.308 104950.628 525874.582 104951.118 525875.268 104951.804 525875.301 104951.902 525871.972 104955.36 525847.95 104958.6 525824.152 104961.864 525821.7 104962.2 525798.7 104964.55 525778.14 104966.53 525773.63 104966.97 525778.15 104958.65 525781.863 104952.578 525783.98 104948.99 525797.06 104948.84 525819.82 104948.3 525845.38 104948.07 525856.52 104948.02 525871.5 104945.55 525871.69 104945.52 525872.5 104945.4 525873.122 104945.339','525875.301','104966.97','525773.63','104945.234'),
('Adur_District_Council','525661.75 105117.9 525685.085 105118.801 525684.99 105119.75 525684.5 105124.8 525684.47 105125.2 525681.819 105124.974 525680.35 105124.85 525671.99 105124.09 525664.24 105123.45 525661.29 105123.2 525661.75 105117.9','525685.085','105125.2','525661.29','105117.9'),
('Adur_District_Council','525685.1 105118.65 525686.35 105118.85 525698.949 105119.35 525700.4 105119.45 525701.699 105119.5 525703.15 105119.5 525709.949 105119.85 525709.29 105128.05 525707.25 105127.85 525699.55 105127.15 525697.9 105127.05 525689.8 105126.35 525685.819 105126 525684.686 105125.905 525684.406 105125.882 525684.47 105125.2 525684.5 105124.8 525684.99 105119.75 525685.1 105118.65','525709.949','105128.05','525684.406','105118.65');

Solution

  • Using any awk:

    $ cat tst.sh
    #!/usr/bin/env bash
    
    gmlfile='Adur_District_Council.gml'
    gml_short="${gmlfile%%.gml}"
    
    awk -v q="'" -v gml_short="$gml_short" '
        BEGIN {
            OFS = q "," q
            print "INSERT IGNORE INTO gmldata.uk (region,coordinates,maxLat,maxLong,minLat,minLong) VALUES"
            ORS = ""
        }
        {
            max_lat = min_lat = $1
            max_lon = min_lon = $2
            for ( i=3; i<NF; i+=2 ) {
                lat = $i
                lon = $(i+1)
                max_lat = (lat > max_lat ? lat : max_lat)
                min_lat = (lat < min_lat ? lat : min_lat)
                max_lon = (lon > max_lon ? lon : max_lon)
                min_lon = (lon < min_lon ? lon : min_lon)
            }
            print sep "(" q gml_short, $0, max_lat, max_lon, min_lat, min_lon q ")"
            sep = ",\n"
        }
        END {
            print ";\n"
        }
    ' coords.txt
    

    $ cat coords.txt
    525873.122 104945.339 525873.13 104945.234 525872.916 104949.942 525873.308 104950.628 525874.582 104951.118 525875.268 104951.804 525875.301 104951.902 525871.972 104955.36 525847.95 104958.6 525824.152 104961.864 525821.7 104962.2 525798.7 104964.55 525778.14 104966.53 525773.63 104966.97 525778.15 104958.65 525781.863 104952.578 525783.98 104948.99 525797.06 104948.84 525819.82 104948.3 525845.38 104948.07 525856.52 104948.02 525871.5 104945.55 525871.69 104945.52 525872.5 104945.4 525873.122 104945.339
    525661.75 105117.9 525685.085 105118.801 525684.99 105119.75 525684.5 105124.8 525684.47 105125.2 525681.819 105124.974 525680.35 105124.85 525671.99 105124.09 525664.24 105123.45 525661.29 105123.2 525661.75 105117.9
    525685.1 105118.65 525686.35 105118.85 525698.949 105119.35 525700.4 105119.45 525701.699 105119.5 525703.15 105119.5 525709.949 105119.85 525709.29 105128.05 525707.25 105127.85 525699.55 105127.15 525697.9 105127.05 525689.8 105126.35 525685.819 105126 525684.686 105125.905 525684.406 105125.882 525684.47 105125.2 525684.5 105124.8 525684.99 105119.75 525685.1 105118.65
    

    $ ./tst.sh
    INSERT IGNORE INTO gmldata.uk (region,coordinates,maxLat,maxLong,minLat,minLong) VALUES
    ('Adur_District_Council','525873.122 104945.339 525873.13 104945.234 525872.916 104949.942 525873.308 104950.628 525874.582 104951.118 525875.268 104951.804 525875.301 104951.902 525871.972 104955.36 525847.95 104958.6 525824.152 104961.864 525821.7 104962.2 525798.7 104964.55 525778.14 104966.53 525773.63 104966.97 525778.15 104958.65 525781.863 104952.578 525783.98 104948.99 525797.06 104948.84 525819.82 104948.3 525845.38 104948.07 525856.52 104948.02 525871.5 104945.55 525871.69 104945.52 525872.5 104945.4 525873.122 104945.339','525875.301','104966.97','525773.63','104945.234'),
    ('Adur_District_Council','525661.75 105117.9 525685.085 105118.801 525684.99 105119.75 525684.5 105124.8 525684.47 105125.2 525681.819 105124.974 525680.35 105124.85 525671.99 105124.09 525664.24 105123.45 525661.29 105123.2 525661.75 105117.9','525685.085','105125.2','525661.29','105117.9'),
    ('Adur_District_Council','525685.1 105118.65 525686.35 105118.85 525698.949 105119.35 525700.4 105119.45 525701.699 105119.5 525703.15 105119.5 525709.949 105119.85 525709.29 105128.05 525707.25 105127.85 525699.55 105127.15 525697.9 105127.05 525689.8 105126.35 525685.819 105126 525684.686 105125.905 525684.406 105125.882 525684.47 105125.2 525684.5 105124.8 525684.99 105119.75 525685.1 105118.65','525709.949','105128.05','525684.406','105118.65');
    

    So your full script would become (untested):

    #!/usr/bin/env bash
    
    gmlfile="$1"
    gmldir='/home/user.name/gml_files'
    gml_short="${gmlfile%%.gml}"
    
    xmllint --xpath "//*[local-name()='posList']" "$gmldir/$gmlfile" | cut -d">" -f2 | cut -d"<" -f1 > "$gmldir/$gml_short-coords.txt" #Dump all coordinates to coordinate file
    
    awk -v q="'" -v gml_short="$gml_short" '
        BEGIN {
            OFS = q "," q
            print "INSERT IGNORE INTO gmldata.uk (region,coordinates,maxLat,maxLong,minLat,minLong) VALUES"
            ORS = ""
        }
        {
            max_lat = min_lat = $1
            max_lon = min_lon = $2
            for ( i=3; i<NF; i+=2 ) {
                lat = $i
                lon = $(i+1)
                max_lat = (lat > max_lat ? lat : max_lat)
                min_lat = (lat < min_lat ? lat : min_lat)
                max_lon = (lon > max_lon ? lon : max_lon)
                min_lon = (lon < min_lon ? lon : min_lon)
            }
            print sep "(" q gml_short, $0, max_lat, max_lon, min_lat, min_lon q ")"
            sep = ",\n"
        }
        END {
            print ";\n"
        }
    ' "$gmldir/$gml_short-coords.txt" > gml_sql_query.sql
    
    mysql < gml_sql_query.sql
    

    or if you'd rather not have that apparently temporary gml_sql_query.sql file:

    $ cat tst.sh
    #!/usr/bin/env bash
    
    gmlfile="$1"
    gmldir='/home/user.name/gml_files'
    gml_short="${gmlfile%%.gml}"
    
    xmllint --xpath "//*[local-name()='posList']" "$gmldir/$gmlfile" | cut -d">" -f2 | cut -d"<" -f1 > "$gmldir/$gml_short-coords.txt" #Dump all coordinates to coordinate file
    
    mysql < <(
        awk -v q="'" -v gml_short="$gml_short" '
            BEGIN {
                OFS = q "," q
                print "INSERT IGNORE INTO gmldata.uk (region,coordinates,maxLat,maxLong,minLat,minLong) VALUES"
                ORS = ""
            }
            {
                max_lat = min_lat = $1
                max_lon = min_lon = $2
                for ( i=3; i<NF; i+=2 ) {
                    lat = $i
                    lon = $(i+1)
                    max_lat = (lat > max_lat ? lat : max_lat)
                    min_lat = (lat < min_lat ? lat : min_lat)
                    max_lon = (lon > max_lon ? lon : max_lon)
                    min_lon = (lon < min_lon ? lon : min_lon)
                }
                print sep "(" q gml_short, $0, max_lat, max_lon, min_lat, min_lon q ")"
                sep = ",\n"
            }
            END {
                print ";\n"
            }
        ' "$gmldir/$gml_short-coords.txt"
    )