Search code examples
rubyxmltimeoutkmllarge-data

Parsing huge (~100mb) kml (xml) file taking *hours* without any sign of actual parsing


I'm currently trying to parse a very large kml (xml) file with ruby (Nokogiri) and am having a little bit of trouble.

The parsing code is good, in fact I'll share it just for the heck of it, even though this code doesn't have much to do with my problem:

geofactory = RGeo::Geographic.projected_factory(:projection_proj4 => "+proj=lcc +lat_1=34.83333333333334 +lat_2=32.5 +lat_0=31.83333333333333 +lon_0=-81 +x_0=609600 +y_0=0 +ellps=GRS80 +to_meter=0.3048 +no_defs", :projection_srid => 3361)
f = File.open("horry_parcels.kml")
kmldoc = Nokogiri::XML(f)

kmldoc.css("//Placemark").each_with_index do |placemark, i|
      puts i
      tds = Nokogiri::HTML(placemark.search("//description").children[0].to_html).search("tr > td")
      h = HorryParcel.new
      h.owner_name = tds.shift.text
      tds.shift
      tds.each_slice(2) do |k, v|
        col = k.text.downcase
        eval("h.#{col} = v.text")
      end
      coords = kmldoc.search("//MultiGeometry")[i].text.gsub("\n", "").gsub("\t", "").split(",0 ").map {|x| x.split(",")}
      points = coords.map { |lon, lat| geofactory.parse_wkt("POINT (#{lon} #{lat})") }
      geo_shape = geofactory.polygon(geofactory.linear_ring(points))
      proj_shape = geo_shape.projection
      h.geo_shape = geo_shape
      h.proj_shape = proj_shape
      h.save
    end

Anyway, I've tested this code with a much, much smaller sample of kml and it works.

However, when I load the real thing, ruby simply waits, as if it is processing something. This "processing", however, has now spanned several hours while I've been doing other things. As you might have noticed, I have a counter (each_with_index) on the array of Placemarks and during this multi-hour period, not a single i value has been put to the command line. Oddly enough it hasn't timed out yet, but even if this works there has got to be a better way to do this thing.

I know I could open up the KML file in Google Earth (Google Earth Pro here) and save the data in smaller, more manageable kml files, but the way things appear to be set up, this would be a very manual, unprofessional process.

Here's a sample of the kml (w/ just one placemark) if that helps.

<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://www.opengis.net/kml/2.2" xmlns:gx="http://www.google.com/kml/ext/2.2" xmlns:kml="http://www.opengis.net/kml/2.2" xmlns:atom="http://www.w3.org/2005/Atom">
<Document>
    <name>justone.kml</name>
    <Style id="PolyStyle00">
        <LabelStyle>
        <color>00000000</color>
        <scale>0</scale>
    </LabelStyle>
    <LineStyle>
        <color>ff0000ff</color>
    </LineStyle>
    <PolyStyle>
        <color>00f0f0f0</color>
    </PolyStyle>
</Style>
<Folder>
    <name>justone</name>
    <open>1</open>
    <Placemark id="ID_010161">
        <name>STUART CHARLES A JR</name>
        <Snippet maxLines="0"></Snippet>
        <description>""</description>
        <styleUrl>#PolyStyle00</styleUrl>
        <MultiGeometry>
            <Polygon>
                <outerBoundaryIs>
                    <LinearRing>
                        <coordinates>
                            -78.941896,33.867893,0     -78.942514,33.868632,0 -78.94342899999999,33.869705,0 -78.943708,33.870083,0 -78.94466799999999,33.871142,0 -78.94511900000001,33.871639,0 -78.94541099999999,33.871776,0 -78.94635,33.872216,0 -78.94637899999999,33.872229,0 -78.94691400000001,33.87248,0 -78.94708300000001,33.87256,0 -78.94783700000001,33.872918,0 -78.947889,33.872942,0 -78.948655,33.873309,0 -78.949589,33.873756,0 -78.950164,33.87403,0 -78.9507,33.873432,0 -78.95077000000001,33.873384,0 -78.950867,33.873354,0 -78.95093199999999,33.873334,0 -78.952518,33.871631,0 -78.95400600000001,33.869583,0 -78.955254,33.867865,0 -78.954606,33.867499,0 -78.953833,33.867172,0 -78.952994,33.866809,0 -78.95272799999999,33.867129,0 -78.952139,33.866803,0 -78.95152299999999,33.86645,0 -78.95134299999999,33.866649,0 -78.95116400000001,33.866847,0 -78.949281,33.867363,0 -78.948936,33.866599,0 -78.94721699999999,33.866927,0 -78.941896,33.867893,0 
                        </coordinates>
                    </LinearRing>
                </outerBoundaryIs>
            </Polygon>
        </MultiGeometry>
    </Placemark>
      </Folder>
  </Document>
</kml>

EDIT: 99.9% of the data I work with is in *.shp format, so I've just ignored this problem for the past week. But I'm going to get this process running on my desktop computer (off of my laptop) and run it until it either times out or finishes.

class ClassName

attr_reader :before, :after

def go
  @before = Time.now
  run_actual_code
  @after = Time.now
  puts "process took #{(@after - @before) seconds} to complete"
end

def run_actual_code
  ...
end

end

The above code should tell me how long it took. From that (if it does actually finish) we should be able to compute a rough rule of thumb for how long you should expect your (otherwise PERFECT) code to run without SAX parsing or "atomization" of the document's text components.


Solution

  • For a huge XML file, you should not use default XML parser from Nokogiri, because it parses as DOM. A much better parsing strategy for large XML files is SAX. Luckly we are, Nokogiri supports SAX.

    The downside is that using a SAX parser all logic should be done with callbacks. The idea is simple: The sax parser starts to read a file and let you know whenever it finds something interesting, for example a tag opening, a tag close, or a text. You will be able to bind callbacks to these events, and extract whatever you need.

    Of course you don't want to use a SAX parser to load all file into the memory and work with it there - this is exactly what SAX want to avoid. You will need to do whatever you want with this file part-by-part.

    So this is basically a rewrite your parsing with callbacks logic. To learn more about XML DOM vs SAX parsers, you might want to check this FAQ from cs.nmsu.edu