I am trying to extract data from https://prtr.defra.gov.uk/full-dataset and the file is called uk_prtr_dataset_2021.xml
library(xml2)
library(dplyr)
xml_file <- read_xml('uk_prtr_dataset_2021.xml')
I want to extract following nodes
node_vec <- c("ParentCompanyName", "FacilityName", "LongitudeMeasure", "LatitudeMeasure", "MainEconomicActivityName")
for(node in seq_along(node_vec)){
node_ref <- node_vec[node]
var_ref <- paste0(".//rsm:",node_ref)
temp <- xml_file %>%
xml_find_all(var_ref) %>%
as_list() %>%
simplify() %>%
enframe() %>%
unnest(value) %>%
unnest(value)
print(paste0(node_ref,": ",nrow(temp)))
}
[1] "ParentCompanyName: 6305"
[1] "FacilityName: 6306"
[1] "LongitudeMeasure: 6305"
[1] "LatitudeMeasure: 6305"
[1] "MainEconomicActivityName: 6305"
However, I see that FacilityName
has one extra entry due to which I cannot bind the 5 columns together. Is there anyway I can figure out which is the extra entry so that I can remove it.
EDIT: A small part of XML
<rsm:PollutantReleaseAndTransferReport xmlns:rsm="urn:eu:com:env:prtr:data:standard:2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:eu:com:env:prtr:data:standard:2 http://www.eionet.europa.eu/schemas/eprtr/PollutantReleaseAndTransferReport_2p0.xsd">rsm:ReportingYear2021</rsm:ReportingYear>rsm:CountryIDUK</rsm:CountryID>rsm:CoordinateSystemIDEPSG:4326</rsm:CoordinateSystemID>rsm:RemarkTextNone</rsm:RemarkText>rsm:CompetentAuthorityPartyrsm:NameBEIS</rsm:Name>rsm:Addressrsm:StreetName1 Victoria Street</rsm:StreetName>rsm:CityNameLondon</rsm:CityName>rsm:PostcodeCodeSW1H 0ET</rsm:PostcodeCode></rsm:Address>rsm:TelephoneCommunicationrsm:CompleteNumberText00 44 774 169 9372</rsm:CompleteNumberText></rsm:TelephoneCommunication>rsm:FaxCommunicationrsm:CompleteNumberText--</rsm:CompleteNumberText></rsm:FaxCommunication>rsm:EmailCommunicationrsm:EmailURIIDShamim.Choudhury@beis.gov.uk</rsm:EmailURIID></rsm:EmailCommunication>rsm:ContactPersonNameShamim Choudhury</rsm:ContactPersonName></rsm:CompetentAuthorityParty>rsm:CompetentAuthorityPartyrsm:NameE1</rsm:Name>rsm:Addressrsm:StreetNameAll correspondence to: Defra, Industrial Pollution, 5F Ergon House, Horseferry Rd. London.</rsm:StreetName>rsm:CityNameLondon</rsm:CityName>rsm:PostcodeCodeSW1P 2AL</rsm:PostcodeCode></rsm:Address>rsm:TelephoneCommunicationrsm:CompleteNumberTextChange Me</rsm:CompleteNumberText></rsm:TelephoneCommunication>rsm:FaxCommunicationrsm:CompleteNumberTextChange Me</rsm:CompleteNumberText></rsm:FaxCommunication>rsm:EmailCommunicationrsm:EmailURIIDprtr@defra.gsi.gov.uk</rsm:EmailURIID></rsm:EmailCommunication>rsm:ContactPersonNameNational Data Manager</rsm:ContactPersonName></rsm:CompetentAuthorityParty>rsm:CompetentAuthorityPartyrsm:NameE10</rsm:Name>rsm:Addressrsm:StreetNameAll correspondence to: Defra, Industrial Pollution, 5F Ergon House, Horseferry Rd. London.</rsm:StreetName>rsm:CityNameLondon</rsm:CityName>rsm:PostcodeCodeSW1P 2AL</rsm:PostcodeCode></rsm:Address>rsm:TelephoneCommunicationrsm:CompleteNumberTextChange Me</rsm:CompleteNumberText></rsm:TelephoneCommunication>rsm:FaxCommunicationrsm:CompleteNumberTextChange Me</rsm:CompleteNumberText></rsm:FaxCommunication>rsm:EmailCommunicationrsm:EmailURIIDprtr@defra.gsi.gov.uk</rsm:EmailURIID></rsm:EmailCommunication>rsm:ContactPersonNameNational Data Manager</rsm:ContactPersonName>
I'd approach this bit differently; this issue resolves on its own when we move from building column vectors to trimmed-down FacilityReport records (i.e. rows for future data.frame / tibble), if something is missing, we'll just have NA
values in those specific rows instead of varying length column vectors.
We can start by finding all rsm:FacilityReport
nodes; to speed things up a bit, we'll remove all subnodes that are not needed. From there, we can turn reports
node-set to a list and create a nested tibble, hoist nested values.
library(xml2)
library(dplyr)
library(tidyr)
tictoc::tic("load and process")
xml_file <- read_xml('uk_prtr_dataset_2021.xml')
# collect all reports
reports <- xml_find_all(xml_file, "//rsm:FacilityReport")
# collect and remove all other nodes so we could get a clean result from as_list(reports)
to_remove <- xml_find_all(reports, "./*[not(self::rsm:ParentCompanyName or self::rsm:FacilityName or self::rsm:GeographicalCoordinate or self::rsm:MainEconomicActivityName)]")
xml_remove(to_remove)
reports_tbl <-
reports |>
as_list() |>
tibble(report = _) |>
# extract nested values
hoist(report,
parent = list("ParentCompanyName", 1),
facility = list("FacilityName", 1),
main_act = list("MainEconomicActivityName", 1),
lon = list("GeographicalCoordinate","LongitudeMeasure", 1),
lat = list("GeographicalCoordinate","LatitudeMeasure", 1),
) |>
select(-report)
tictoc::toc()
#> load and process: 7.28 sec elapsed
Resulting frame along with the record that caused issues:
reports_tbl
#> # A tibble: 6,306 × 5
#> parent facility main_act lon lat
#> <chr> <chr> <chr> <chr> <chr>
#> 1 IQE (Europe) Ltd IQE (Europe) Limited, St Mellons… Manufac… -3.0… 51.5…
#> 2 G B Davies & Co G B Davies & Co, Plas Bach Farm Raising… -3.1… 52.7…
#> 3 Hook 2 Sisters Limited Hook 2 Sisters Limited, Merllyn … Raising… -3.2… 53.2…
#> 4 Hook 2 Sisters Limited Hook 2 Sisters Limited, Morlais Raising… -4.3… 53.1…
#> 5 Mills Poultry Ltd. Mills Poultry Ltd., Pentreucha F… Raising… -3.4… 52.6…
#> 6 Mills Poultry Ltd. Mills Poultry Ltd., Minffordd Fa… Raising… -3.4… 52.6…
#> 7 Mr Philip Weale P J Weale, Highfield Poultry Farm Raising… -4.0… 51.5…
#> 8 Uniper UK Limited Connah'S Quay Power Station Product… -3.0… 53.2…
#> 9 Glas Cymru Cyfyngedig Nash Stw - Final Effluent Sewerage -2.9… 51.5…
#> 10 Glas Cymru Cyfyngedig Five Fords Stw Sewerage -2.9… 53.0…
#> # ℹ 6,296 more rows
# offending FacilityReport:
reports_tbl[!complete.cases(reports_tbl),]
#> # A tibble: 1 × 5
#> parent facility main_act lon lat
#> <chr> <chr> <chr> <chr> <chr>
#> 1 <NA> Newly created but not approved <NA> <NA> <NA>
Created on 2023-11-22 with reprex v2.0.2
A rsm:FacilityReport
node for reference:
<rsm:FacilityReport>
<rsm:NationalID>Wales_KP3235SS</rsm:NationalID>
<rsm:PreviousNationalID>
<rsm:NationalID>Wales_KP3235SS</rsm:NationalID>
<rsm:ReportingYear>2020</rsm:ReportingYear>
</rsm:PreviousNationalID>
<rsm:ParentCompanyName>IQE (Europe) Ltd</rsm:ParentCompanyName>
<rsm:FacilityName>IQE (Europe) Limited, St Mellons Semiconductor Plant</rsm:FacilityName>
<rsm:Address>
<rsm:StreetName>Cypress Drive, Cardiff</rsm:StreetName>
<rsm:BuildingNumber>--</rsm:BuildingNumber>
<rsm:CityName>Cardiff</rsm:CityName>
<rsm:PostcodeCode>CF3 0LW</rsm:PostcodeCode>
</rsm:Address>
<rsm:GeographicalCoordinate>
<rsm:LongitudeMeasure>-3.085770</rsm:LongitudeMeasure>
<rsm:LatitudeMeasure>51.530800</rsm:LatitudeMeasure>
</rsm:GeographicalCoordinate>
<rsm:RiverBasinDistrictID>UK09</rsm:RiverBasinDistrictID>
<rsm:NACEMainEconomicActivityCode>27.90</rsm:NACEMainEconomicActivityCode>
<rsm:MainEconomicActivityName>Manufacture of other electrical equipment</rsm:MainEconomicActivityName>
<rsm:CompetentAuthorityPartyName>NRW02</rsm:CompetentAuthorityPartyName>
<rsm:NutsRegionID>UKL2</rsm:NutsRegionID>
<rsm:ConfidentialIndicator>0</rsm:ConfidentialIndicator>
<rsm:ProtectVoluntaryData>0</rsm:ProtectVoluntaryData>
<rsm:Activity>
<rsm:RankingNumeric>1</rsm:RankingNumeric>
<rsm:AnnexIActivityCode>4.(b).(v)</rsm:AnnexIActivityCode>
</rsm:Activity>
<rsm:PollutantRelease>
<rsm:MediumCode>AIR</rsm:MediumCode>
<rsm:PollutantCode>NH3</rsm:PollutantCode>
<rsm:MethodBasisCode>C</rsm:MethodBasisCode>
<rsm:MethodUsed>
<rsm:MethodTypeCode>OTH</rsm:MethodTypeCode>
<rsm:Designation>none specified</rsm:Designation>
</rsm:MethodUsed>
<rsm:TotalQuantity unitCode="KGM">1.98</rsm:TotalQuantity>
<rsm:AccidentalQuantity unitCode="KGM">0.00</rsm:AccidentalQuantity>
<rsm:ConfidentialIndicator>0</rsm:ConfidentialIndicator>
<rsm:RemarkText>significant drop 2020 v 2021 is due to vastly decreased ammonia emissions to air.</rsm:RemarkText>
</rsm:PollutantRelease>
<rsm:PollutantRelease>
<rsm:MediumCode>AIR</rsm:MediumCode>
<rsm:PollutantCode>AS AND COMPOUNDS</rsm:PollutantCode>
<rsm:MethodBasisCode>C</rsm:MethodBasisCode>
<rsm:MethodUsed>
<rsm:MethodTypeCode>OTH</rsm:MethodTypeCode>
<rsm:Designation>none specified</rsm:Designation>
</rsm:MethodUsed>
<rsm:TotalQuantity unitCode="KGM">6.56</rsm:TotalQuantity>
<rsm:AccidentalQuantity unitCode="KGM">0.00</rsm:AccidentalQuantity>
<rsm:ConfidentialIndicator>0</rsm:ConfidentialIndicator>
</rsm:PollutantRelease>
<rsm:WasteTransfer>
<rsm:WasteTypeCode>NON-HW</rsm:WasteTypeCode>
<rsm:WasteTreatmentCode>D</rsm:WasteTreatmentCode>
<rsm:Quantity unitCode="TNE">2.60</rsm:Quantity>
<rsm:MethodBasisCode>C</rsm:MethodBasisCode>
<rsm:MethodUsed>
<rsm:MethodTypeCode>OTH</rsm:MethodTypeCode>
<rsm:Designation>none specified</rsm:Designation>
</rsm:MethodUsed>
<rsm:ConfidentialIndicator>0</rsm:ConfidentialIndicator>
<rsm:RemarkText>2021 non-hazardous waste to landfill.</rsm:RemarkText>
</rsm:WasteTransfer>
<rsm:WasteTransfer>
<rsm:WasteTypeCode>NON-HW</rsm:WasteTypeCode>
<rsm:WasteTreatmentCode>R</rsm:WasteTreatmentCode>
<rsm:Quantity unitCode="TNE">7.56</rsm:Quantity>
<rsm:MethodBasisCode>C</rsm:MethodBasisCode>
<rsm:MethodUsed>
<rsm:MethodTypeCode>OTH</rsm:MethodTypeCode>
<rsm:Designation>none specified</rsm:Designation>
</rsm:MethodUsed>
<rsm:ConfidentialIndicator>0</rsm:ConfidentialIndicator>
<rsm:RemarkText>waste to energy - 5107kg
recycling - 2576kg</rsm:RemarkText>
</rsm:WasteTransfer>
<rsm:WasteTransfer>
<rsm:WasteTypeCode>HWIC</rsm:WasteTypeCode>
<rsm:WasteTreatmentCode>D</rsm:WasteTreatmentCode>
<rsm:Quantity unitCode="TNE">236</rsm:Quantity>
<rsm:MethodBasisCode>C</rsm:MethodBasisCode>
<rsm:MethodUsed>
<rsm:MethodTypeCode>OTH</rsm:MethodTypeCode>
<rsm:Designation>none specified</rsm:Designation>
</rsm:MethodUsed>
<rsm:ConfidentialIndicator>0</rsm:ConfidentialIndicator>
</rsm:WasteTransfer>
</rsm:FacilityReport>