Search code examples
rpivot-tablereshapecolumnsortingreadxl

Opening and reshaping xlsx files with nameless columns in r using a pattern


I'm working with French electoral data but I'm having issues opening xlsx files to work on them in r. I was wondering if anyone had had the same problem and found a solution.

The issue is that only the first 29 columns out of +100 columns have names and the rest are nameless. I've tried editing the column names in excel before opening them but this solution is time consuming and prone to mistakes. I'm looking for a way to automatize the process.

The datasets have a pattern that I'm trying to exploit to rename the columns and reshape the files:

  • the first 6 columns correspond to the geographic id of the precinct (region, municipality, etc...)
  • the next 15 columns give information about aggregate results in the precinct (number of voters, number of registered voters, participation, etc..)
  • The next 8 columns give information about a given candidate and her results in the precinct (name, sex, party id, number of votes, .. etc)

These 29 columns have names.

The next columns are nameless and correspond to other candidates. They repeat the 8 columns for the other candidates.

There is another layer of difficulty since each precinct does not have the same number of candidates so the number of nameless columns changes.

Ideally, I would want r to recognize the pattern and reshape the datasets to long by creating a new row for each candidate keeping the precinct id and aggregate data in each row. To do this, I would like r to recognize each sequence of nameless 8 columns.

To simplify, let's say that my data frame looks like the following:

precinct_id tot_votes candidate_id candidate_votes ...1 ...2
Paris 05 1000 Jean Dupont 400 Paul Dupuy 300
Paris 06 500 Jean Dupont 50 Paul Dupuy 150

where:

  • candidate_id and candidate_votes correspond to the id and result of the first candidate
  • ...1, ...2 is how r is automatically renaming the nameless columns that correspond to candidate_id and candidate_votes for candidate 2 in the same precinct.

I need r to select the observations in each sequence of 2 columns and paste them into new rows under candidate_id candidate_votes while keeping the precinct_id and precinct_votes columns.

precinct_id tot_votes candidate_id candidate_votes
Paris 05 1000 Jean Dupont 400
Paris 06 500 Jean Dupont 50
Paris 05 1000 Paul Dupuy 300
Paris 06 500 Paul Dupuy 150

I have no idea how to reshape without column names... Any help would be greatly appreciated! Thanks!

PS: The files come from here: https://www.data.gouv.fr/fr/datasets/elections-legislatives-des-12-et-19-juin-2022-resultats-definitifs-du-premier-tour/


Solution

  • Actually, there's an even simpler solution to the one I suggested. .names_repair can take a function as its value. This function should accept a vector of "input" column names and return a vector of "output column names". As we want to treat the data for the first candidate in each row in eactly the same way as every subsequent set of eight columns, I'll ignore only the first 21 columns, not the first 29.

    read_excel(
      "resultats-par-niveau-subcom-t1-france-entiere.xlsx",
      .name_repair=function(x) {
        suffixes <- c("NPanneau", "Sexe", "Nom", "Prénom", "Nuance", "Voix", "PctVoixIns", "PctVoixExp")
        if ((length(x) - 21) %% 8 != 0) stop(paste("Don't know how to handle a sheet with", length(x), "columns [", (length(x) - 21) %% 8, "]"))
        for (i in 1:length(x)) {
          if (i > 21) {
            x[i] <- paste0("C", 1 + floor((i-22)/8), "_", suffixes[1 + (i-22) %% 8])
          }
        }
        x
      }
    )
    # A tibble: 35,429 × 197                                                                                                                       
       `Code du département` `Libellé du dép…` `Code de la ci…` `Libellé de la…` `Code de la co…` `Libellé de la…` `Etat saisie` Inscrits Abstentions
       <chr>                 <chr>             <chr>            <chr>            <chr>            <chr>            <chr>            <dbl>       <dbl>
     1 01                    Ain               01               1ère circonscri… 016              Arbigny          Complet            327         154
     2 01                    Ain               01               1ère circonscri… 024              Attignat         Complet           2454        1281
     3 01                    Ain               01               1ère circonscri… 029              Beaupont         Complet            446         224
     4 01                    Ain               01               1ère circonscri… 038              Bény             Complet            604         306
     5 01                    Ain               01               1ère circonscri… 040              Béréziat         Complet            362         179
     6 01                    Ain               01               1ère circonscri… 050              Boissey          Complet            262         137
     7 01                    Ain               01               1ère circonscri… 053              Bourg-en-Bresse  Complet          15516        8426
     8 01                    Ain               01               1ère circonscri… 057              Boz              Complet            391         210
     9 01                    Ain               01               1ère circonscri… 065              Buellas          Complet           1408         654
    10 01                    Ain               01               1ère circonscri… 069              Certines         Complet           1169         639
    # … with 35,419 more rows, and 188 more variables: `% Abs/Ins` <dbl>, Votants <dbl>, `% Vot/Ins` <dbl>, Blancs <dbl>, `% Blancs/Ins` <dbl>,
    #   `% Blancs/Vot` <dbl>, Nuls <dbl>, `% Nuls/Ins` <dbl>, `% Nuls/Vot` <dbl>, Exprimés <dbl>, `% Exp/Ins` <dbl>, `% Exp/Vot` <dbl>,
    #   C1_NPanneau <dbl>, C1_Sexe <chr>, C1_Nom <chr>, C1_Prénom <chr>, C1_Nuance <chr>, C1_Voix <dbl>, C1_PctVoixIns <dbl>, C1_PctVoixExp <dbl>,
    #   C2_NPanneau <dbl>, C2_Sexe <chr>, C2_Nom <chr>, C2_Prénom <chr>, C2_Nuance <chr>, C2_Voix <dbl>, C2_PctVoixIns <dbl>, C2_PctVoixExp <dbl>,
    #   C3_NPanneau <dbl>, C3_Sexe <chr>, C3_Nom <chr>, C3_Prénom <chr>, C3_Nuance <chr>, C3_Voix <dbl>, C3_PctVoixIns <dbl>, C3_PctVoixExp <dbl>,
    #   C4_NPanneau <dbl>, C4_Sexe <chr>, C4_Nom <chr>, C4_Prénom <chr>, C4_Nuance <chr>, C4_Voix <dbl>, C4_PctVoixIns <dbl>, C4_PctVoixExp <dbl>,
    #   C5_NPanneau <dbl>, C5_Sexe <chr>, C5_Nom <chr>, C5_Prénom <chr>, C5_Nuance <chr>, C5_Voix <dbl>, C5_PctVoixIns <dbl>, C5_PctVoixExp <dbl>, …  
    

    That's read the data in and named the columns. To get the final format you want, we will need to do a standard pivot_longer()/pivot_wider() trick, but the situation here is slightly complicated because some of your columns are character and some are numeric. So first, I'll turn the numeric columns into character columns so that the pivot_longer() step doesn't fail.

    For clarity, I'll drop the first 21 columns so that it's easy to see what's going on.

    read_excel(
      "resultats-par-niveau-subcom-t1-france-entiere.xlsx",
      .name_repair=function(x) {
        suffixes <- c("NPanneau", "Sexe", "Nom", "Prénom", "Nuance", "Voix", "PctVoixIns", "PctVoixExp")
        if ((length(x) - 21) %% 8 != 0) stop(paste("Don't know how to handle a sheet with", length(x), "columns [", (length(x) - 21) %% 8, "]"))
        for (i in 1:length(x)) {
          if (i > 21) {
            x[i] <- paste0("C", 1 + floor((i-22)/8), "_", suffixes[1 + (i-22) %% 8])
          }
        }
        x
      }
    )  %>% 
    mutate(across(where(is.numeric) | where(is.logical), as.character)) %>% 
    pivot_longer(!1:21, names_sep="_", names_to=c("Candidate", "Variable"), values_to="Value") %>% 
    select(!1:21)
    # A tibble: 6,235,504 × 3                                                                                                                      
       Candidate Variable   Value
       <chr>     <chr>      <chr>
     1 C1        NPanneau   2    
     2 C1        Sexe       M    
     3 C1        Nom        LAHY 
     4 C1        Prénom     Éric 
     5 C1        Nuance     DXG  
     6 C1        Voix       2    
     7 C1        PctVoixIns 0.61 
     8 C1        PctVoixExp 1.23 
     9 C2        NPanneau   8    
    10 C2        Sexe       M    
    # … with 6,235,494 more rows
    

    Now add the pivot_wider(), again dropping the first 21 columns, purely for clarity.

    read_excel(
      "resultats-par-niveau-subcom-t1-france-entiere.xlsx",
      .name_repair=function(x) {
        suffixes <- c("NPanneau", "Sexe", "Nom", "Prénom", "Nuance", "Voix", "PctVoixIns", "PctVoixExp")
        if ((length(x) - 21) %% 8 != 0) stop(paste("Don't know how to handle a sheet with", length(x), "columns [", (length(x) - 21) %% 8, "]"))
        for (i in 1:length(x)) {
          if (i > 21) {
            x[i] <- paste0("C", 1 + floor((i-22)/8), "_", suffixes[1 + (i-22) %% 8])
          }
        }
        x
      }
    )  %>% 
    mutate(across(where(is.numeric) | where(is.logical), as.character)) %>% 
    pivot_longer(!1:21, names_sep="_", names_to=c("Candidate", "Variable"), values_to="Value") %>% 
    pivot_wider(names_from=Variable, values_from=Value) %>% 
    select(!1:21)
    # A tibble: 779,438 × 9                                                                                                                        
       Candidate NPanneau Sexe  Nom              Prénom    Nuance Voix  PctVoixIns PctVoixExp
       <chr>     <chr>    <chr> <chr>            <chr>     <chr>  <chr> <chr>      <chr>     
     1 C1        2        M     LAHY             Éric      DXG    2     0.61       1.23      
     2 C2        8        M     GUÉRAUD          Sébastien NUP    26    7.95       15.95     
     3 C3        7        F     ARMENJON         Eliane    ECO    3     0.92       1.84      
     4 C4        1        M     GUILLERMIN       Vincent   ENS    30    9.17       18.4      
     5 C5        3        M     BRETON           Xavier    LR     44    13.46      26.99     
     6 C6        5        M     MENDES           Michael   DSV    3     0.92       1.84      
     7 C7        6        M     BELLON           Julien    REC    6     1.83       3.68      
     8 C8        4        F     PIROUX GIANNOTTI Brigitte  RN     49    14.98      30.06     
     9 C9        NA       NA    NA               NA        NA     NA    NA         NA        
    10 C10       NA       NA    NA               NA        NA     NA    NA         NA        
    # … with 779,428 more rows
    

    Finally, convert the "temporary character" columns back to numeric. (Still dropping the first 21 columns for clarity.)

    read_excel(
      "resultats-par-niveau-subcom-t1-france-entiere.xlsx",
      .name_repair=function(x) {
        suffixes <- c("NPanneau", "Sexe", "Nom", "Prénom", "Nuance", "Voix", "PctVoixIns", "PctVoixExp")
        if ((length(x) - 21) %% 8 != 0) stop(paste("Don't know how to handle a sheet with", length(x), "columns [", (length(x) - 21) %% 8, "]"))
        for (i in 1:length(x)) {
          if (i > 21) {
            x[i] <- paste0("C", 1 + floor((i-22)/8), "_", suffixes[1 + (i-22) %% 8])
          }
        }
        x
      }
    )  %>% 
    mutate(across(where(is.numeric) | where(is.logical), as.character)) %>% 
    pivot_longer(!1:21, names_sep="_", names_to=c("Candidate", "Variable"), values_to="Value") %>% 
    pivot_wider(names_from=Variable, values_from=Value) %>% 
    mutate(across(c(Voix, PctVoixIns, PctVoixExp), as.numeric)) %>% 
    select(!1:21)
    # A tibble: 779,438 × 9                                                                                                                        
       Candidate NPanneau Sexe  Nom              Prénom    Nuance  Voix PctVoixIns PctVoixExp
       <chr>     <chr>    <chr> <chr>            <chr>     <chr>  <dbl>      <dbl>      <dbl>
     1 C1        2        M     LAHY             Éric      DXG        2       0.61       1.23
     2 C2        8        M     GUÉRAUD          Sébastien NUP       26       7.95      16.0 
     3 C3        7        F     ARMENJON         Eliane    ECO        3       0.92       1.84
     4 C4        1        M     GUILLERMIN       Vincent   ENS       30       9.17      18.4 
     5 C5        3        M     BRETON           Xavier    LR        44      13.5       27.0 
     6 C6        5        M     MENDES           Michael   DSV        3       0.92       1.84
     7 C7        6        M     BELLON           Julien    REC        6       1.83       3.68
     8 C8        4        F     PIROUX GIANNOTTI Brigitte  RN        49      15.0       30.1 
     9 C9        NA       NA    NA               NA        NA        NA      NA         NA   
    10 C10       NA       NA    NA               NA        NA        NA      NA         NA   
    # … with 779,428 more rows
    

    This, I think, is the format you want, though you may need to arrange() the rows into the order you want. Obviously, you should drop the final %>% select(!1:21) for your production version.

    It is an easy matter to convert this code to a function that accepts a filename as its parameter and then use this in an lapply to read an entire folder into a list of data frames. However...

    • It appears that not every file in the folder has the same layout. resultats-par-niveau-fe-t1-outre-mer.xlsx, for example, appears to have fewer "prefix columns" before the 8-columns-per-candidate repeat begins.
    • The import generates several warnings. This appears to be because the election(?) with the largest number of candidates does not appear in the first rows of the worksheet. I've not investigated whether these warnings are generated by meaningful problems with the import.