Search code examples
kml

Split a giant kml file


I have a giant kml file with the following structure:

<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://www.opengis.net/kml/2.2">
  <Document>
    <Style id="transBluePoly">
      <LineStyle>
        <width>1.5</width>
      </LineStyle>
      <PolyStyle>
        <color>30ffa911</color>
      </PolyStyle>
    </Style>
    <Style id="labelStyle">
       <IconStyle>
          <color>ffffa911</color>
          <scale>0.35</scale>
       </IconStyle>
       <LabelStyle>
         <color>ffffffff</color>
         <scale>0.35</scale>
      </LabelStyle>
    </Style>
    <Placemark>
      <name>9840229084|2013-03-06 13:41:34.0|rent|0.0|2|0|0|1|T|5990F529FB98F28A1F17D182152201A4|0|null|null|null|null|null|null|null|null|null|null|F|F|0|NO_POSTCODE</name>
      <styleUrl>#transBluePoly</styleUrl>
      <Polygon>
        <outerBoundaryIs>
          <LinearRing>
            <coordinates>
            -1.5191200,53.4086600
            -1.5214300,53.4011900
            -1.5303600,53.4028800
            -1.5435800,53.4033900
            -1.5404900,53.4083600
            -1.5191200,53.4086600
            </coordinates>
          </LinearRing>
        </outerBoundaryIs>
      </Polygon>
    </Placemark>
    <Placemark>
      <name>9840031669|2013-03-06 13:14:22.0|rent|0.0|0|0|0|1|F|E5BAC836984F53F91D7F60F247920F0C|0|null|null|null|null|null|null|null|null|null|null|F|F|3641161|DE4 3JT</name>
      <styleUrl>#transBluePoly</styleUrl>
      <Polygon>
        <outerBoundaryIs>
          <LinearRing>
            <coordinates>
            -1.2370933,53.1227587
            -1.2304837,53.1690463
            -1.1783129,53.2226956
            -1.2016444,53.2833233
            -1.3213687,53.3248921
            -1.4809916,53.3039582
            -1.6167192,53.2438689
            -1.5593782,53.1336370
            -1.4296123,53.0962399
            -1.3205129,53.1024090
            -1.2370933,53.1227587
            </coordinates>
          </LinearRing>
        </outerBoundaryIs>
      </Polygon>
    </Placemark>

I need to extract 1 million polygons from this to make it manageable (know geo DB is ultimate solution - looking for a quick fix).

Loading it into a lightweight text editor and just deleting some lines would be my first port of call, but suspect this will take forever and a day (it's 10 Gb, I've got 16 Gb RAM). Just wondering if there is a more intelligent solution from the a linux terminal that avoids having to read it all into RAM. I've seen perl and bash commands for doing this but can't see how they would work for taking a random (or first million) sample: http://www.unix.com/shell-programming-scripting/159470-filter-kml-file-xml-remove-unwanted-entries.html


Solution

  • You can use a KML parsing library and few lines of code to parse out what you need in a large KML or KMZ file.

    Java

    The GIScore Java library, for example, uses STaX to parse the KML source file one feature at a time so it does not need to load the entire file into memory. The library works very fast so a 10GB won't take very long.

    Here's a simple Java program that extracts points from polygons inside a KML file, which doesn't matter how large the KML file nor if the Placemark is deeply nested within folders.

    import org.opensextant.geodesy.Geodetic2DPoint;
    import org.opensextant.giscore.events.*;
    import org.opensextant.giscore.geometry.*;
    import org.opensextant.giscore.input.kml.KmlInputStream;
    
    import java.io.FileInputStream;
    import java.io.IOException;
    import java.text.DecimalFormat;
    
    public class Test {
    
      public static void main(String[] args) throws IOException {
        KmlInputStream kis = new KmlInputStream(new FileInputStream("test.kml"));
        IGISObject obj;
        DecimalFormat df = new DecimalFormat("0.0#####");
        while((obj = kis.read()) != null) {
          if (obj instanceof Feature) {
            Feature f = (Feature)obj;
            Geometry g = f.getGeometry();
            if (g instanceof Polygon) {
              System.out.println("Points");
              for(Point p : ((Polygon)g).getOuterRing().getPoints()) {
                // do something with the points (e.g. insert in database, etc.)
                Geodetic2DPoint pt = p.asGeodetic2DPoint();
                System.out.printf("%s,%s%n",
                        df.format(pt.getLatitudeAsDegrees()),
                        df.format(pt.getLongitudeAsDegrees()));
              }
            }
          }
        }
        kis.close();
      }
    }
    

    To run, create source file Test.java in the directory src/main/java and copy the code above in the file.

    If the Geometry is a MultiGeometry then you'd need to add a check for that and iterate over the sub-geometries.

    Using Gradle, here's a sample build.gradle script to run the above test program using the command: gradle run

    apply plugin: 'java'
    
    repositories {
        mavenCentral()
    }
    
    task run (dependsOn: 'compileJava', type: JavaExec) {
        main = 'Test'
        classpath = sourceSets.main.runtimeClasspath
    }
    
    dependencies {
        compile 'org.opensextant:geodesy:2.0.1'
        compile 'org.opensextant:giscore:2.0.1'
    }
    

    This does require that you install both Gradle and Java Development Kit (JDK).

    Python

    Alternatively, you can parse KML using Python with pykml library. Can create multiple smaller KML files with some logic to split the polygons or insert the polygon geometry features into a PostgreSQL database, etc. There is support for pykml in stackoverflow using the tag.

    from pykml import parser
    import re
    
    with open('data.kml', 'r') as fp:
      doc = parser.parse(fp)
    
    for pm in doc.getroot().Document.Placemark:
      print(pm.name)
      # Get the coordinates from either polygon or polygon inside multigeometry
      if hasattr(pm, 'MultiGeometry'):
        pm = pm.MultiGeometry
      if hasattr(pm, 'Polygon'):
        pcoords = pm.Polygon.outerBoundaryIs.LinearRing.coordinates.text
        # extract coords into a list of lon,lat coord pairs
        coords = re.split(r'\s+', pcoords.strip())
        for coord in coords:
            lonlat = coord.split(',')
            if len(lonlat) > 1:
                print(lonlat[0], lonlat[1])
                # add logic here - insert points into DB, etc.