Search code examples
rgiskmlr-sf

How to read multiple layers with st_read when there are duplicate layer names


I have a kml file, the unpacked version of this. It has several thousand layers with the XML tag, many of which have duplicate layer names.

I'd like to load this into R using sf::st_read. The trick is st_read reads one layer at a time and requires a layer name. I would be happy to iterate through the layer names fetched with st_layers()if they were unique, but they are not.

Is there an alternative way to specify the desired layer or, perhaps, a way to batch rename all the layers with a unique Id?

Thanks.

adding some color based on the accepted answer below. Initially, I tried to use 'read_xml' to edit the <name> nodes but they didn't seem to be found.

I downloaded the KMZ file, loaded it into Google Earth, then saved it back out as a KML file ("Reports.kml"). This was my first mistake. The resulting KML is tab-delimited which confuses read_xml. It is valid XML but the tags are not recognized properly by read_xml even though the st_ functions work. Better to use unzip on the KMZ file. Here is what happens with the saved-with-Google-Earth version:

layers<-st_layers("reports.kml")

data_frame(name=layers$name, type=flatten_chr(layers$geomtype)) %>%
  count(name, type, sort=TRUE)
# A tibble: 1,358 x 3
#            name  type     n
#           <chr> <chr> <int>
# 1     July 2006          25
# 2  October 2006          25
# 3   August 2008          20
# 4     July 2009          19
# 5   August 2005          18
# 6   August 2007          18
# 7 November 2006          18
# 8  October 2004          17
# 9   August 2000          16
#10 November 2012          16
# ... with 1,348 more rows

kml<-read_xml("reports.kml")

xml_find_all(kml, ".//Folder/name")
# {xml_nodeset (0)}

Nothing! But there is something there:

xml_children(kml)
# {xml_nodeset (1)}
# [1] <Folder>\n  <name>Reports</name>\n  <open>1</open>\n  <Folder>\n    
# <name>Class A</name>\n  ...

Here is what happens with the unzipped KMZ :

download.file(url="http://www.bfro.net/app/AllReportsKMZ.aspx",
                  destfile = "AllBFROReports.kmz",
                  mode="wb")
unzip("AllBFROReports.kmz",junkpaths = TRUE) #creates "doc.kml"


layers <- st_layers("doc.kml")

data_frame(name=layers$name, type=flatten_chr(layers$geomtype)) %>%
  count(name, type, sort=TRUE)
# # A tibble: 1,376 x 3
# name  type     n
# <chr> <chr> <int>
#   1     July 2006          25
# 2  October 2006          25
# 3   August 2008          20
# 4     July 2009          19
# 5   August 2005          18
# 6   August 2007          18
# 7 November 2006          18
# 8  October 2004          17
# 9   August 2000          16
# 10 November 2012          16
# # ... with 1,366 more rows

st_layers is the same, but now the nodes are properly found!

kml <- read_xml("doc.kml")
xml_find_all(kml, ".//Folder/name")
{xml_nodeset (3874)}
[1] <name>June 2000</name>
  [2] <name> 1995</name>
  [3] <name>February 2004</name>
  [4] <name>June 2004</name>
  [5] <name>February 2004</name>
  [6] <name>April 2008</name>
  [7] <name>July 2009</name>
  [8] <name>September 1981 and 1982</name>
  [9] <name>July 1999</name>
  [10] <name>November 1983</name>
  [11] <name>October 2000</name>
  [12] <name>August 1993</name>
  [13] <name> 79, 80, 99</name>
  [14] <name> 1978</name>
  [15] <name>November 1980</name>
  [16] <name>January 1997</name>
  [17] <name> 1990</name>
  [18] <name>December 1996</name>
  [19] <name> 2000</name>
  [20] <name> 2001</name>
  ...

Now the answer provided below works like a charm!


Solution

  • A bit of XML surgery will do the trick.

    First, show the problem:

    library(sf)
    library(xml2)
    library(tidyverse)
    
    layers <- st_layers("AllBFROReports.kml")
    
    data_frame(name=layers$name, type=flatten_chr(layers$geomtype)) %>%
      count(name, type, sort=TRUE)
    ## # A tibble: 1,376 x 3
    ##             name  type     n
    ##            <chr> <chr> <int>
    ##  1     July 2006          25
    ##  2  October 2006          25
    ##  3   August 2008          20
    ##  4     July 2009          19
    ##  5   August 2005          18
    ##  6   August 2007          18
    ##  7 November 2006          18
    ##  8  October 2004          17
    ##  9   August 2000          16
    ## 10 November 2012          16
    ## # ... with 1,366 more rows
    

    ugh. A very mean person made that file.

    Now, read it in "raw":

    kml <- read_xml("AllBFROReports.kml")
    

    Add a sequential index number to each layer name:

    idx <- 0
    xml_find_all(kml, ".//Folder/name") %>%
      walk(~{
        idx <<- idx + 1
        xml_text(.x) <- sprintf("%s-%s", idx, xml_text(.x))
      })
    

    Make a new file:

    write_xml(kml, "AllBFROReports-unique.kml")
    

    Prove it worked:

    layers2 <- st_layers("AllBFROReports-unique.kml")
    
    data_frame(name=layers2$name, type=flatten_chr(layers2$geomtype)) %>%
      count(name, type, sort=TRUE)
    ## # A tibble: 3,874 x 3
    ##                  name     type     n
    ##                 <chr>    <chr> <int>
    ##  1        1-June 2000              1
    ##  2   10-November 1983              1
    ##  3 100-September 1992              1
    ##  4  1000-October 1987              1
    ##  5  1001-October 1987              1
    ##  6  1002-October 1979              1
    ##  7     1003-June 1993 3D Point     1
    ##  8         1004- 1982 3D Point     1
    ##  9         1005- 1982 3D Point     1
    ## 10   1006-August 1977 3D Point     1
    ## # ... with 3,864 more rows
    

    Read one layer in with the new indexified-name:

    st_read("AllBFROReports-unique.kml", layer = "10-November 1983")
    ## Reading layer `10-November 1983' from data source `/Users/bob/Desktop/AllBFROReports-unique.kml' using driver `KML'
    ## Simple feature collection with 2 features and 2 fields
    ## geometry type:  GEOMETRY
    ## dimension:      XYZ
    ## bbox:           xmin: -86.4677 ymin: 34.9484 xmax: -86.4441 ymax: 34.9637
    ## epsg (SRID):    4326
    ## proj4string:    +proj=longlat +datum=WGS84 +no_defs