Search code examples
rxml2

unequal vector in xml from different nodes


I am trying to extract data from https://prtr.defra.gov.uk/full-dataset and the file is called uk_prtr_dataset_2021.xml

library(xml2)
library(dplyr) 

xml_file <- read_xml('uk_prtr_dataset_2021.xml') 

I want to extract following nodes

node_vec <- c("ParentCompanyName", "FacilityName", "LongitudeMeasure", "LatitudeMeasure", "MainEconomicActivityName")

for(node in seq_along(node_vec)){

  node_ref <- node_vec[node]
  var_ref <- paste0(".//rsm:",node_ref)

  temp <- xml_file %>%   
           xml_find_all(var_ref) %>%
           as_list() %>%
           simplify() %>%
           enframe() %>%
           unnest(value) %>%
           unnest(value)

print(paste0(node_ref,": ",nrow(temp)))
}

[1] "ParentCompanyName: 6305"
[1] "FacilityName: 6306"
[1] "LongitudeMeasure: 6305"
[1] "LatitudeMeasure: 6305"
[1] "MainEconomicActivityName: 6305"

However, I see that FacilityName has one extra entry due to which I cannot bind the 5 columns together. Is there anyway I can figure out which is the extra entry so that I can remove it.

EDIT: A small part of XML

<rsm:PollutantReleaseAndTransferReport xmlns:rsm="urn:eu:com:env:prtr:data:standard:2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:eu:com:env:prtr:data:standard:2 http://www.eionet.europa.eu/schemas/eprtr/PollutantReleaseAndTransferReport_2p0.xsd">rsm:ReportingYear2021</rsm:ReportingYear>rsm:CountryIDUK</rsm:CountryID>rsm:CoordinateSystemIDEPSG:4326</rsm:CoordinateSystemID>rsm:RemarkTextNone</rsm:RemarkText>rsm:CompetentAuthorityPartyrsm:NameBEIS</rsm:Name>rsm:Addressrsm:StreetName1 Victoria Street</rsm:StreetName>rsm:CityNameLondon</rsm:CityName>rsm:PostcodeCodeSW1H 0ET</rsm:PostcodeCode></rsm:Address>rsm:TelephoneCommunicationrsm:CompleteNumberText00 44 774 169 9372</rsm:CompleteNumberText></rsm:TelephoneCommunication>rsm:FaxCommunicationrsm:CompleteNumberText--</rsm:CompleteNumberText></rsm:FaxCommunication>rsm:EmailCommunicationrsm:[email protected]</rsm:EmailURIID></rsm:EmailCommunication>rsm:ContactPersonNameShamim Choudhury</rsm:ContactPersonName></rsm:CompetentAuthorityParty>rsm:CompetentAuthorityPartyrsm:NameE1</rsm:Name>rsm:Addressrsm:StreetNameAll correspondence to: Defra, Industrial Pollution, 5F Ergon House, Horseferry Rd. London.</rsm:StreetName>rsm:CityNameLondon</rsm:CityName>rsm:PostcodeCodeSW1P 2AL</rsm:PostcodeCode></rsm:Address>rsm:TelephoneCommunicationrsm:CompleteNumberTextChange Me</rsm:CompleteNumberText></rsm:TelephoneCommunication>rsm:FaxCommunicationrsm:CompleteNumberTextChange Me</rsm:CompleteNumberText></rsm:FaxCommunication>rsm:EmailCommunicationrsm:[email protected]</rsm:EmailURIID></rsm:EmailCommunication>rsm:ContactPersonNameNational Data Manager</rsm:ContactPersonName></rsm:CompetentAuthorityParty>rsm:CompetentAuthorityPartyrsm:NameE10</rsm:Name>rsm:Addressrsm:StreetNameAll correspondence to: Defra, Industrial Pollution, 5F Ergon House, Horseferry Rd. London.</rsm:StreetName>rsm:CityNameLondon</rsm:CityName>rsm:PostcodeCodeSW1P 2AL</rsm:PostcodeCode></rsm:Address>rsm:TelephoneCommunicationrsm:CompleteNumberTextChange Me</rsm:CompleteNumberText></rsm:TelephoneCommunication>rsm:FaxCommunicationrsm:CompleteNumberTextChange Me</rsm:CompleteNumberText></rsm:FaxCommunication>rsm:EmailCommunicationrsm:[email protected]</rsm:EmailURIID></rsm:EmailCommunication>rsm:ContactPersonNameNational Data Manager</rsm:ContactPersonName>


Solution

  • I'd approach this bit differently; this issue resolves on its own when we move from building column vectors to trimmed-down FacilityReport records (i.e. rows for future data.frame / tibble), if something is missing, we'll just have NA values in those specific rows instead of varying length column vectors.

    We can start by finding all rsm:FacilityReport nodes; to speed things up a bit, we'll remove all subnodes that are not needed. From there, we can turn reports node-set to a list and create a nested tibble, hoist nested values.

    library(xml2)
    library(dplyr) 
    library(tidyr)
    
    tictoc::tic("load and process")
    xml_file <- read_xml('uk_prtr_dataset_2021.xml') 
    
    # collect all reports
    reports <- xml_find_all(xml_file, "//rsm:FacilityReport")
    
    # collect and remove all other nodes so we could get a clean result from as_list(reports)
    to_remove <- xml_find_all(reports, "./*[not(self::rsm:ParentCompanyName or self::rsm:FacilityName or self::rsm:GeographicalCoordinate or self::rsm:MainEconomicActivityName)]")
    xml_remove(to_remove)
    
    reports_tbl <- 
      reports |> 
      as_list() |> 
      tibble(report = _) |>
      # extract nested values
      hoist(report, 
            parent = list("ParentCompanyName", 1),
            facility = list("FacilityName", 1),
            main_act = list("MainEconomicActivityName", 1),
            lon = list("GeographicalCoordinate","LongitudeMeasure", 1),
            lat = list("GeographicalCoordinate","LatitudeMeasure", 1),
            ) |>
      select(-report)
    tictoc::toc()
    #> load and process: 7.28 sec elapsed
    

    Resulting frame along with the record that caused issues:

    reports_tbl
    #> # A tibble: 6,306 × 5
    #>    parent                 facility                          main_act lon   lat  
    #>    <chr>                  <chr>                             <chr>    <chr> <chr>
    #>  1 IQE (Europe) Ltd       IQE (Europe) Limited, St Mellons… Manufac… -3.0… 51.5…
    #>  2 G B Davies & Co        G B Davies & Co, Plas Bach Farm   Raising… -3.1… 52.7…
    #>  3 Hook 2 Sisters Limited Hook 2 Sisters Limited, Merllyn … Raising… -3.2… 53.2…
    #>  4 Hook 2 Sisters Limited Hook 2 Sisters Limited, Morlais   Raising… -4.3… 53.1…
    #>  5 Mills Poultry Ltd.     Mills Poultry Ltd., Pentreucha F… Raising… -3.4… 52.6…
    #>  6 Mills Poultry Ltd.     Mills Poultry Ltd., Minffordd Fa… Raising… -3.4… 52.6…
    #>  7 Mr Philip Weale        P J Weale, Highfield Poultry Farm Raising… -4.0… 51.5…
    #>  8 Uniper UK Limited      Connah'S Quay Power Station       Product… -3.0… 53.2…
    #>  9 Glas Cymru Cyfyngedig  Nash Stw - Final Effluent         Sewerage -2.9… 51.5…
    #> 10 Glas Cymru Cyfyngedig  Five Fords Stw                    Sewerage -2.9… 53.0…
    #> # ℹ 6,296 more rows
    
    # offending FacilityReport:
    reports_tbl[!complete.cases(reports_tbl),]
    #> # A tibble: 1 × 5
    #>   parent facility                       main_act lon   lat  
    #>   <chr>  <chr>                          <chr>    <chr> <chr>
    #> 1 <NA>   Newly created but not approved <NA>     <NA>  <NA>
    

    Created on 2023-11-22 with reprex v2.0.2

    A rsm:FacilityReport node for reference:

        <rsm:FacilityReport>
            <rsm:NationalID>Wales_KP3235SS</rsm:NationalID>
            <rsm:PreviousNationalID>
                <rsm:NationalID>Wales_KP3235SS</rsm:NationalID>
                <rsm:ReportingYear>2020</rsm:ReportingYear>
            </rsm:PreviousNationalID>
            <rsm:ParentCompanyName>IQE (Europe) Ltd</rsm:ParentCompanyName>
            <rsm:FacilityName>IQE (Europe) Limited, St Mellons Semiconductor Plant</rsm:FacilityName>
            <rsm:Address>
                <rsm:StreetName>Cypress Drive, Cardiff</rsm:StreetName>
                <rsm:BuildingNumber>--</rsm:BuildingNumber>
                <rsm:CityName>Cardiff</rsm:CityName>
                <rsm:PostcodeCode>CF3 0LW</rsm:PostcodeCode>
            </rsm:Address>
            <rsm:GeographicalCoordinate>
                <rsm:LongitudeMeasure>-3.085770</rsm:LongitudeMeasure>
                <rsm:LatitudeMeasure>51.530800</rsm:LatitudeMeasure>
            </rsm:GeographicalCoordinate>
            <rsm:RiverBasinDistrictID>UK09</rsm:RiverBasinDistrictID>
            <rsm:NACEMainEconomicActivityCode>27.90</rsm:NACEMainEconomicActivityCode>
            <rsm:MainEconomicActivityName>Manufacture of other electrical equipment</rsm:MainEconomicActivityName>
            <rsm:CompetentAuthorityPartyName>NRW02</rsm:CompetentAuthorityPartyName>
            <rsm:NutsRegionID>UKL2</rsm:NutsRegionID>
            <rsm:ConfidentialIndicator>0</rsm:ConfidentialIndicator>
            <rsm:ProtectVoluntaryData>0</rsm:ProtectVoluntaryData>
            <rsm:Activity>
                <rsm:RankingNumeric>1</rsm:RankingNumeric>
                <rsm:AnnexIActivityCode>4.(b).(v)</rsm:AnnexIActivityCode>
            </rsm:Activity>
            <rsm:PollutantRelease>
                <rsm:MediumCode>AIR</rsm:MediumCode>
                <rsm:PollutantCode>NH3</rsm:PollutantCode>
                <rsm:MethodBasisCode>C</rsm:MethodBasisCode>
                <rsm:MethodUsed>
                    <rsm:MethodTypeCode>OTH</rsm:MethodTypeCode>
                    <rsm:Designation>none specified</rsm:Designation>
                </rsm:MethodUsed>
                <rsm:TotalQuantity unitCode="KGM">1.98</rsm:TotalQuantity>
                <rsm:AccidentalQuantity unitCode="KGM">0.00</rsm:AccidentalQuantity>
                <rsm:ConfidentialIndicator>0</rsm:ConfidentialIndicator>
                <rsm:RemarkText>significant drop 2020 v 2021 is due to vastly decreased ammonia emissions to air.</rsm:RemarkText>
            </rsm:PollutantRelease>
            <rsm:PollutantRelease>
                <rsm:MediumCode>AIR</rsm:MediumCode>
                <rsm:PollutantCode>AS AND COMPOUNDS</rsm:PollutantCode>
                <rsm:MethodBasisCode>C</rsm:MethodBasisCode>
                <rsm:MethodUsed>
                    <rsm:MethodTypeCode>OTH</rsm:MethodTypeCode>
                    <rsm:Designation>none specified</rsm:Designation>
                </rsm:MethodUsed>
                <rsm:TotalQuantity unitCode="KGM">6.56</rsm:TotalQuantity>
                <rsm:AccidentalQuantity unitCode="KGM">0.00</rsm:AccidentalQuantity>
                <rsm:ConfidentialIndicator>0</rsm:ConfidentialIndicator>
            </rsm:PollutantRelease>
            <rsm:WasteTransfer>
                <rsm:WasteTypeCode>NON-HW</rsm:WasteTypeCode>
                <rsm:WasteTreatmentCode>D</rsm:WasteTreatmentCode>
                <rsm:Quantity unitCode="TNE">2.60</rsm:Quantity>
                <rsm:MethodBasisCode>C</rsm:MethodBasisCode>
                <rsm:MethodUsed>
                    <rsm:MethodTypeCode>OTH</rsm:MethodTypeCode>
                    <rsm:Designation>none specified</rsm:Designation>
                </rsm:MethodUsed>
                <rsm:ConfidentialIndicator>0</rsm:ConfidentialIndicator>
                <rsm:RemarkText>2021 non-hazardous waste to landfill.</rsm:RemarkText>
            </rsm:WasteTransfer>
            <rsm:WasteTransfer>
                <rsm:WasteTypeCode>NON-HW</rsm:WasteTypeCode>
                <rsm:WasteTreatmentCode>R</rsm:WasteTreatmentCode>
                <rsm:Quantity unitCode="TNE">7.56</rsm:Quantity>
                <rsm:MethodBasisCode>C</rsm:MethodBasisCode>
                <rsm:MethodUsed>
                    <rsm:MethodTypeCode>OTH</rsm:MethodTypeCode>
                    <rsm:Designation>none specified</rsm:Designation>
                </rsm:MethodUsed>
                <rsm:ConfidentialIndicator>0</rsm:ConfidentialIndicator>
                <rsm:RemarkText>waste to energy - 5107kg&#13;
    recycling - 2576kg</rsm:RemarkText>
            </rsm:WasteTransfer>
            <rsm:WasteTransfer>
                <rsm:WasteTypeCode>HWIC</rsm:WasteTypeCode>
                <rsm:WasteTreatmentCode>D</rsm:WasteTreatmentCode>
                <rsm:Quantity unitCode="TNE">236</rsm:Quantity>
                <rsm:MethodBasisCode>C</rsm:MethodBasisCode>
                <rsm:MethodUsed>
                    <rsm:MethodTypeCode>OTH</rsm:MethodTypeCode>
                    <rsm:Designation>none specified</rsm:Designation>
                </rsm:MethodUsed>
                <rsm:ConfidentialIndicator>0</rsm:ConfidentialIndicator>
            </rsm:WasteTransfer>
        </rsm:FacilityReport>