I am trying to retrieve some xml data with Swedish election statistics and create a data frame in R out of it, but I'm not that familiar with xml files can't get out the information I want. I've seen some other questions on how to create a data frame from many XML files, but they have a simpler structure than the data I'm working with.
The data is published as a zipped folder with many XML files. It can be read through the following R code:
library(xml2)
library(tidyverse)
tf <- tempfile(tmpdir = tdir <- tempdir())
download.file("https://data.val.se/val/val2014/valnatt/valnatt.zip", tf)
xml_files <- unzip(tf, exdir = tdir)
The folder contains files for each of the 290 municipalities (files with 4 digit codes) and each election type, where the final letter in the filename indicate the type of election (R=national parliament, L=county council, K=municipal council). It also contains 3 XML files for total results for each of the three election types. The XML files with municipal data have the following structure (lines deleted for clarity):
<?xml version="1.0" encoding="ISO-8859-1"?>
<?xml-stylesheet type="text/html"?>
<!DOCTYPE VAL PUBLIC "-//Valmyndigheten//DTD Valresultat parti kommun 1.5//SV" "http://www.val.se/dtd/resultat/parti_kommun_1_5.dtd">
<VAL TILLFÄLLE="Allmänna val 14 september 2014" FILNAMN="valnatt_0114R.xml" RAPPORTERING="VALNATTSRAPPORTERING" VALTYP="Riksdagsval" VALDAG="20140914" VALDAG_FGVAL="20100919" TID_RAPPORT="20140916105203">
<PARTI FÖRKORTNING="M" BETECKNING="Moderaterna" FÄRG="#66BEE6" />
<KOMMUN KOD="0114" NAMN="Upplands Väsby" TYP="Summering" KLARA_VALDISTRIKT="22" ALLA_VALDISTRIKT="22" RÖSTER="23638" RÖSTER_FGVAL="22215" TID_RAPPORT="20140914230336" MODNR="117144935">
<GILTIGA PARTI="M" RÖSTER="6748" RÖSTER_FGVAL="8201" PROCENT="28,5" PROCENT_FGVAL="36,9" PROCENT_ÄNDRING="-8,4"/>
<GILTIGA PARTI="C" RÖSTER="901" RÖSTER_FGVAL="891" PROCENT="3,8" PROCENT_FGVAL="4,0" PROCENT_ÄNDRING="-0,2"/>
<KRETS_KOMMUN KOD="011401" NAMN="Norra valkretsen" TYP="Summering" KLARA_VALDISTRIKT="12" ALLA_VALDISTRIKT="12" RÖSTER="11907" RÖSTER_FGVAL="11202" TID_RAPPORT="20140914222651" MODNR="117118974">
<GILTIGA PARTI="M" RÖSTER="3083" RÖSTER_FGVAL="3860" PROCENT="25,9" PROCENT_FGVAL="34,5" PROCENT_ÄNDRING="-8,6"/>
<GILTIGA PARTI="C" RÖSTER="440" RÖSTER_FGVAL="431" PROCENT="3,7" PROCENT_FGVAL="3,8" PROCENT_ÄNDRING="-0,2"/>
<VALDISTRIKT KOD="01140212" NAMN="Smedby Södra" RÖSTER="1201" RÖSTER_FGVAL="1186" TID_RAPPORT="20140914230336" MODNR="117144935">
<GILTIGA PARTI="M" RÖSTER="227" RÖSTER_FGVAL="336" PROCENT="18,9" PROCENT_FGVAL="28,3" PROCENT_ÄNDRING="-9,4"/>
<GILTIGA PARTI="C" RÖSTER="35" RÖSTER_FGVAL="17" PROCENT="2,9" PROCENT_FGVAL="1,4" PROCENT_ÄNDRING="+1,5"/>
<GILTIGA PARTI="FP" RÖSTER="43" RÖSTER_FGVAL="61" PROCENT="3,6" PROCENT_FGVAL="5,1" PROCENT_ÄNDRING="-1,6"/>
<ÖVRIGA_GILTIGA RÖSTER="20" RÖSTER_FGVAL="10" PROCENT="1,7" PROCENT_FGVAL="0,8" PROCENT_ÄNDRING="+0,8"/>
<OGILTIGA TEXT="BLANK" RÖSTER="12" RÖSTER_FGVAL="13" PROCENT="1,0" PROCENT_FGVAL="1,1" PROCENT_ÄNDRING="-0,1"/>
<OGILTIGA TEXT="OG" RÖSTER="13" RÖSTER_FGVAL="1" PROCENT="1,1" PROCENT_FGVAL="0,1" PROCENT_ÄNDRING="+1,0"/>
<VALDELTAGANDE RÖSTBERÄTTIGADE="1551" RÖSTBERÄTTIGADE_KLARA_VALDISTRIKT_FGVAL="1546" SUMMA_RÖSTER="1226" SUMMA_RÖSTER_FGVAL="1200" PROCENT="79,0" PROCENT_FGVAL="77,6" PROCENT_ÄNDRING="+1,4"/>
</VALDISTRIKT>
</KRETS_KOMMUN>
</KOMMUN>
</VAL>
Now, I would like for each file to get the data within all the VALDISTRIKT
nodes and below and create a data frame. I'm not sure how to best structure such a data frame, but the following structure would suffice, where GROUP
contain PARTI within GILTIGA, TEXT within OGILTIGA and just ÖVRIGA_GILTIGA within ÖVRIGA_GILTIGA. If possible, I also would like to add PROCENT and PROCENT_FG_VAL from within VALDELTAGANDE as variables (with the same information for each row within one VALDISTRIKT).
KOD NAMN GROUP RÖSTER RÖSTER_FG_VAL PROCENT PROCENT_FG_VAL PROCENT_FÖRÄNDRING
01140212 "Smedby Södra" M 227 336 18,9 18,3 -9,4
01140212 "Smedby Södra" C 35 17 2,9 1,4 +1,5
01140212 "Smedby Södra" FP 43 61 3,6 5,1 -1,6
01140212 "Smedby Södra" ÖVRIGA_GILTIGA 20 10 1,7 0,8 +0,8
01140212 "Smedby Södra" BLANK 12 13 1,0 1,1 -0,1
01140212 "Smedby Södra" OG 13 1 1,1 0,1 +1,0
This information should be retrieved from each VALDISTRIKT in each of the 290 files with a name with 4 digits and end with a R. I guess I should loop over those files, or maybe rather use map_df
?
I understand that this is a lot to ask in a question, and I'm sorry if I'm not using the correct terms for parts of an XML file, but if you could give me some pointers on how to get the information from the xml files into a data frame or where I could read more about how to do this, it would be greatly appreciated.
UPDATE
I've managed to take a few steps forward. For one file, I can get all the information into two separate data frames using the following code, where top includes data about the district and below includes election results. I now just have to find a way to combine the two and adjust the code to read all the files.
top <- xml_find_all(t, "//VALDISTRIKT")
top <- top %>%
map(xml_attrs) %>%
map_df(~as.list(.))
below <- xml_find_all(t, "//VALDISTRIKT/*")
below <- p2 %>%
map(xml_attrs) %>%
map_df(~as.list(.))
All the best, R
I got an answer in RStudio Community and I thought I might add it here as well, in case it might be of any help for anyone else.
library(xml2)
library(tidyverse)
# Make a temporary file (tf) and a temporary folder (tdir)
tf <- tempfile(tmpdir = tdir <- tempdir())
# Download the zip file
download.file("https://data.val.se/val/val2014/valnatt/valnatt.zip", tf)
# Unzip it in the temp folder
xml_files <- unzip(tf, exdir = tdir)
# Get the filenames of the files to import
# They have 4 digits in the file name, and ends with the letter K
files_to_import <- fs::dir_ls(tdir) %>%
str_subset(pattern = "valnatt_\\d{4}K.xml$")
# Create a function to read a file and get the information wanted
read_dist <- . %>%
read_xml() %>%
xml_find_all(., "//VALDISTRIKT") %>%
map_dfr(~ {
# extract the attributes from the parent tag as a data.frame
parent <- xml_attrs(.x) %>% enframe() %>% spread(name, value)
# make a data.frame out of the attributes of the kids
kids <- xml_children(.x) %>% map_dfr(~ as.list(xml_attrs(.x)))
# combine them (bind_cols does not repeat parent rows)
cbind.data.frame(parent, kids) %>% set_tidy_names() %>% as_tibble()
})
# Map over all the files
df <- map_df(files_to_import, read_dist)