I'm working with French electoral data but I'm having issues opening xlsx files to work on them in r. I was wondering if anyone had had the same problem and found a solution.
The issue is that only the first 29 columns out of +100 columns have names and the rest are nameless. I've tried editing the column names in excel before opening them but this solution is time consuming and prone to mistakes. I'm looking for a way to automatize the process.
The datasets have a pattern that I'm trying to exploit to rename the columns and reshape the files:
These 29 columns have names.
The next columns are nameless and correspond to other candidates. They repeat the 8 columns for the other candidates.
There is another layer of difficulty since each precinct does not have the same number of candidates so the number of nameless columns changes.
Ideally, I would want r to recognize the pattern and reshape the datasets to long by creating a new row for each candidate keeping the precinct id and aggregate data in each row. To do this, I would like r to recognize each sequence of nameless 8 columns.
To simplify, let's say that my data frame looks like the following:
precinct_id | tot_votes | candidate_id | candidate_votes | ...1 | ...2 |
---|---|---|---|---|---|
Paris 05 | 1000 | Jean Dupont | 400 | Paul Dupuy | 300 |
Paris 06 | 500 | Jean Dupont | 50 | Paul Dupuy | 150 |
where:
I need r to select the observations in each sequence of 2 columns and paste them into new rows under candidate_id candidate_votes while keeping the precinct_id and precinct_votes columns.
precinct_id | tot_votes | candidate_id | candidate_votes |
---|---|---|---|
Paris 05 | 1000 | Jean Dupont | 400 |
Paris 06 | 500 | Jean Dupont | 50 |
Paris 05 | 1000 | Paul Dupuy | 300 |
Paris 06 | 500 | Paul Dupuy | 150 |
I have no idea how to reshape without column names... Any help would be greatly appreciated! Thanks!
PS: The files come from here: https://www.data.gouv.fr/fr/datasets/elections-legislatives-des-12-et-19-juin-2022-resultats-definitifs-du-premier-tour/
Actually, there's an even simpler solution to the one I suggested. .names_repair
can take a function as its value. This function should accept a vector of "input" column names and return a vector of "output column names". As we want to treat the data for the first candidate in each row in eactly the same way as every subsequent set of eight columns, I'll ignore only the first 21 columns, not the first 29.
read_excel(
"resultats-par-niveau-subcom-t1-france-entiere.xlsx",
.name_repair=function(x) {
suffixes <- c("NPanneau", "Sexe", "Nom", "Prénom", "Nuance", "Voix", "PctVoixIns", "PctVoixExp")
if ((length(x) - 21) %% 8 != 0) stop(paste("Don't know how to handle a sheet with", length(x), "columns [", (length(x) - 21) %% 8, "]"))
for (i in 1:length(x)) {
if (i > 21) {
x[i] <- paste0("C", 1 + floor((i-22)/8), "_", suffixes[1 + (i-22) %% 8])
}
}
x
}
)
# A tibble: 35,429 × 197
`Code du département` `Libellé du dép…` `Code de la ci…` `Libellé de la…` `Code de la co…` `Libellé de la…` `Etat saisie` Inscrits Abstentions
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 01 Ain 01 1ère circonscri… 016 Arbigny Complet 327 154
2 01 Ain 01 1ère circonscri… 024 Attignat Complet 2454 1281
3 01 Ain 01 1ère circonscri… 029 Beaupont Complet 446 224
4 01 Ain 01 1ère circonscri… 038 Bény Complet 604 306
5 01 Ain 01 1ère circonscri… 040 Béréziat Complet 362 179
6 01 Ain 01 1ère circonscri… 050 Boissey Complet 262 137
7 01 Ain 01 1ère circonscri… 053 Bourg-en-Bresse Complet 15516 8426
8 01 Ain 01 1ère circonscri… 057 Boz Complet 391 210
9 01 Ain 01 1ère circonscri… 065 Buellas Complet 1408 654
10 01 Ain 01 1ère circonscri… 069 Certines Complet 1169 639
# … with 35,419 more rows, and 188 more variables: `% Abs/Ins` <dbl>, Votants <dbl>, `% Vot/Ins` <dbl>, Blancs <dbl>, `% Blancs/Ins` <dbl>,
# `% Blancs/Vot` <dbl>, Nuls <dbl>, `% Nuls/Ins` <dbl>, `% Nuls/Vot` <dbl>, Exprimés <dbl>, `% Exp/Ins` <dbl>, `% Exp/Vot` <dbl>,
# C1_NPanneau <dbl>, C1_Sexe <chr>, C1_Nom <chr>, C1_Prénom <chr>, C1_Nuance <chr>, C1_Voix <dbl>, C1_PctVoixIns <dbl>, C1_PctVoixExp <dbl>,
# C2_NPanneau <dbl>, C2_Sexe <chr>, C2_Nom <chr>, C2_Prénom <chr>, C2_Nuance <chr>, C2_Voix <dbl>, C2_PctVoixIns <dbl>, C2_PctVoixExp <dbl>,
# C3_NPanneau <dbl>, C3_Sexe <chr>, C3_Nom <chr>, C3_Prénom <chr>, C3_Nuance <chr>, C3_Voix <dbl>, C3_PctVoixIns <dbl>, C3_PctVoixExp <dbl>,
# C4_NPanneau <dbl>, C4_Sexe <chr>, C4_Nom <chr>, C4_Prénom <chr>, C4_Nuance <chr>, C4_Voix <dbl>, C4_PctVoixIns <dbl>, C4_PctVoixExp <dbl>,
# C5_NPanneau <dbl>, C5_Sexe <chr>, C5_Nom <chr>, C5_Prénom <chr>, C5_Nuance <chr>, C5_Voix <dbl>, C5_PctVoixIns <dbl>, C5_PctVoixExp <dbl>, …
That's read the data in and named the columns. To get the final format you want, we will need to do a standard pivot_longer()/pivot_wider()
trick, but the situation here is slightly complicated because some of your columns are character and some are numeric. So first, I'll turn the numeric columns into character columns so that the pivot_longer()
step doesn't fail.
For clarity, I'll drop the first 21 columns so that it's easy to see what's going on.
read_excel(
"resultats-par-niveau-subcom-t1-france-entiere.xlsx",
.name_repair=function(x) {
suffixes <- c("NPanneau", "Sexe", "Nom", "Prénom", "Nuance", "Voix", "PctVoixIns", "PctVoixExp")
if ((length(x) - 21) %% 8 != 0) stop(paste("Don't know how to handle a sheet with", length(x), "columns [", (length(x) - 21) %% 8, "]"))
for (i in 1:length(x)) {
if (i > 21) {
x[i] <- paste0("C", 1 + floor((i-22)/8), "_", suffixes[1 + (i-22) %% 8])
}
}
x
}
) %>%
mutate(across(where(is.numeric) | where(is.logical), as.character)) %>%
pivot_longer(!1:21, names_sep="_", names_to=c("Candidate", "Variable"), values_to="Value") %>%
select(!1:21)
# A tibble: 6,235,504 × 3
Candidate Variable Value
<chr> <chr> <chr>
1 C1 NPanneau 2
2 C1 Sexe M
3 C1 Nom LAHY
4 C1 Prénom Éric
5 C1 Nuance DXG
6 C1 Voix 2
7 C1 PctVoixIns 0.61
8 C1 PctVoixExp 1.23
9 C2 NPanneau 8
10 C2 Sexe M
# … with 6,235,494 more rows
Now add the pivot_wider()
, again dropping the first 21 columns, purely for clarity.
read_excel(
"resultats-par-niveau-subcom-t1-france-entiere.xlsx",
.name_repair=function(x) {
suffixes <- c("NPanneau", "Sexe", "Nom", "Prénom", "Nuance", "Voix", "PctVoixIns", "PctVoixExp")
if ((length(x) - 21) %% 8 != 0) stop(paste("Don't know how to handle a sheet with", length(x), "columns [", (length(x) - 21) %% 8, "]"))
for (i in 1:length(x)) {
if (i > 21) {
x[i] <- paste0("C", 1 + floor((i-22)/8), "_", suffixes[1 + (i-22) %% 8])
}
}
x
}
) %>%
mutate(across(where(is.numeric) | where(is.logical), as.character)) %>%
pivot_longer(!1:21, names_sep="_", names_to=c("Candidate", "Variable"), values_to="Value") %>%
pivot_wider(names_from=Variable, values_from=Value) %>%
select(!1:21)
# A tibble: 779,438 × 9
Candidate NPanneau Sexe Nom Prénom Nuance Voix PctVoixIns PctVoixExp
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 C1 2 M LAHY Éric DXG 2 0.61 1.23
2 C2 8 M GUÉRAUD Sébastien NUP 26 7.95 15.95
3 C3 7 F ARMENJON Eliane ECO 3 0.92 1.84
4 C4 1 M GUILLERMIN Vincent ENS 30 9.17 18.4
5 C5 3 M BRETON Xavier LR 44 13.46 26.99
6 C6 5 M MENDES Michael DSV 3 0.92 1.84
7 C7 6 M BELLON Julien REC 6 1.83 3.68
8 C8 4 F PIROUX GIANNOTTI Brigitte RN 49 14.98 30.06
9 C9 NA NA NA NA NA NA NA NA
10 C10 NA NA NA NA NA NA NA NA
# … with 779,428 more rows
Finally, convert the "temporary character" columns back to numeric. (Still dropping the first 21 columns for clarity.)
read_excel(
"resultats-par-niveau-subcom-t1-france-entiere.xlsx",
.name_repair=function(x) {
suffixes <- c("NPanneau", "Sexe", "Nom", "Prénom", "Nuance", "Voix", "PctVoixIns", "PctVoixExp")
if ((length(x) - 21) %% 8 != 0) stop(paste("Don't know how to handle a sheet with", length(x), "columns [", (length(x) - 21) %% 8, "]"))
for (i in 1:length(x)) {
if (i > 21) {
x[i] <- paste0("C", 1 + floor((i-22)/8), "_", suffixes[1 + (i-22) %% 8])
}
}
x
}
) %>%
mutate(across(where(is.numeric) | where(is.logical), as.character)) %>%
pivot_longer(!1:21, names_sep="_", names_to=c("Candidate", "Variable"), values_to="Value") %>%
pivot_wider(names_from=Variable, values_from=Value) %>%
mutate(across(c(Voix, PctVoixIns, PctVoixExp), as.numeric)) %>%
select(!1:21)
# A tibble: 779,438 × 9
Candidate NPanneau Sexe Nom Prénom Nuance Voix PctVoixIns PctVoixExp
<chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 C1 2 M LAHY Éric DXG 2 0.61 1.23
2 C2 8 M GUÉRAUD Sébastien NUP 26 7.95 16.0
3 C3 7 F ARMENJON Eliane ECO 3 0.92 1.84
4 C4 1 M GUILLERMIN Vincent ENS 30 9.17 18.4
5 C5 3 M BRETON Xavier LR 44 13.5 27.0
6 C6 5 M MENDES Michael DSV 3 0.92 1.84
7 C7 6 M BELLON Julien REC 6 1.83 3.68
8 C8 4 F PIROUX GIANNOTTI Brigitte RN 49 15.0 30.1
9 C9 NA NA NA NA NA NA NA NA
10 C10 NA NA NA NA NA NA NA NA
# … with 779,428 more rows
This, I think, is the format you want, though you may need to arrange()
the rows into the order you want. Obviously, you should drop the final %>% select(!1:21)
for your production version.
It is an easy matter to convert this code to a function that accepts a filename as its parameter and then use this in an lapply
to read an entire folder into a list of data frames. However...