Search code examples
rdataframerecodereorganize

How to convert a variable number of columns per row to multiple rows in R?


Here is my problem: I have an individual database: 1 row = 1 person. For each person, there is a unique identifier ("INAMI_key"), individual variables ("code_qualif" in the example below) and one or several addresses filled in through different columns. The number of addresses is indicated in the "n_addresses" variable: 1 for one address, 2 for two addresses, etc. The different addresses are indicated in the variables "travail_ruex" (street) and "travail_code_postalx" (postal code). This is what it looks like:

   INAMI_key code_qualif n_adresses travail_rue1                       travail_code_postal1 travail_rue2                       travail_code_postal2 travail_rue3                       travail_code_postal3 travail_rue4                       travail_code_postal4 travail_rue5                       travail_code_postal5
 1 30000120  001         02         "RUE VAN ARTEVELDE               " " 1000"              "paul pastur                     " " 6180"              ""                                 ""                   ""                                 ""                   ""                                 ""                  
 2 30000417  001         01         "av Margueritr depasse           " " 6060"              ""                                 ""                   ""                                 ""                   ""                                 ""                   ""                                 ""                  
 3 37603435  007         01         "du Grand Veneur                 " " 1170"              ""                                 ""                   ""                                 ""                   ""                                 ""                   ""                                 ""                  
 4 38300152  007         02         "RUE WAUTERS                92-94" " 6040"              "de châtelet                     " " 6120"              ""                                 ""                   ""                                 ""                   ""                                 ""                  
 5 38707849  001         03         "de Campine                      " " 4000"              "de Chestret                     " " 4000"              "de la Gare                      " " 4020"              ""                                 ""                   ""                                 ""                  
 6 38813856  001         03         "Torhoutste steenweg             " " 8400"              "Lage Kaart                      " " 2930"              "De Vrièrestraat                 " " 8301"              ""                                 ""                   ""                                 ""                  
 7 38811084  001         04         "chaussée de Waterloo            " " 1180"              "avenue Napoléon                 " " 1420"              "rue Léon Théodor                " " 1090"              "avenue de la Basilique          " " 1081"              ""                                 ""                  
 8 39105054  001         04         "EMILE  CLAUS                    " " 1050"              "RUE DU FOYER SCHAERBEEKOIS      " " 1030"              "RUE XAVIER DE BUE               " " 1180"              "BV LAMBERMONT                   " " 1030"              ""                                 ""                  
 9 39117031  001         05         "KERKSTRAAT                      " " 3850"              "Pater Richard van de Wouwerstraa" " 3271"              "Wilderenlaan                    " " 3803"              "Gyzevennestraat                 " " 3560"              "Molenveldstraat                 " " 3500"             
10 31823918  070         05         "Route de l'Etat                 " " 1380"              "Avenue Paul Hymans              " " 1200"              "Avenue WInston Churchill        " " 1180"              "Avenue Winston Churchill        " " 1180"              "avenue hippocrate               " " 1200"             

Here you have the code to import this example in R :

structure(list(INAMI_key = c("30000120", "30000417", "37603435", 
"38300152", "38707849", "38813856", "38811084", "39105054", "39117031", 
"31823918"), code_qualif = c("001", "001", "007", "007", "001", 
"001", "001", "001", "001", "070"), n_adresses = c("02", "01", 
"01", "02", "03", "03", "04", "04", "05", "05"), travail_rue1 = c("RUE VAN ARTEVELDE               ", 
"av Margueritr depasse           ", "du Grand Veneur                 ", 
"RUE WAUTERS                92-94", "de Campine                      ", 
"Torhoutste steenweg             ", "chaussée de Waterloo            ", 
"EMILE  CLAUS                    ", "KERKSTRAAT                      ", 
"Route de l'Etat                 "), travail_code_postal1 = c(" 1000", 
" 6060", " 1170", " 6040", " 4000", " 8400", " 1180", " 1050", 
" 3850", " 1380"), travail_rue2 = c("paul pastur                     ", 
"", "", "de châtelet                     ", "de Chestret                     ", 
"Lage Kaart                      ", "avenue Napoléon                 ", 
"RUE DU FOYER SCHAERBEEKOIS      ", "Pater Richard van de Wouwerstraa", 
"Avenue Paul Hymans              "), travail_code_postal2 = c(" 6180", 
"", "", " 6120", " 4000", " 2930", " 1420", " 1030", " 3271", 
" 1200"), travail_rue3 = c("", "", "", "", "de la Gare                      ", 
"De Vrièrestraat                 ", "rue Léon Théodor                ", 
"RUE XAVIER DE BUE               ", "Wilderenlaan                    ", 
"Avenue WInston Churchill        "), travail_code_postal3 = c("", 
"", "", "", " 4020", " 8301", " 1090", " 1180", " 3803", " 1180"
), travail_rue4 = c("", "", "", "", "", "", "avenue de la Basilique          ", 
"BV LAMBERMONT                   ", "Gyzevennestraat                 ", 
"Avenue Winston Churchill        "), travail_code_postal4 = c("", 
"", "", "", "", "", " 1081", " 1030", " 3560", " 1180"), travail_rue5 = c("", 
"", "", "", "", "", "", "", "Molenveldstraat                 ", 
"avenue hippocrate               "), travail_code_postal5 = c("", 
"", "", "", "", "", "", "", " 3500", " 1200")), row.names = c(NA, 
-10L), class = c("tbl_df", "tbl", "data.frame"))

What I would like to do is multiply the number of rows per individual to show the different addresses on different rows, but in the same fields. For example, if an individual has 3 addresses, create 3 rows for the same individual, keep the individual variables, but reorganize the addresses into columns that have the same name: "travail_rue_total" and "travail_code_postal_total" in the example below. If the individual has 1 adress, create one row, if he has 5 adresses, create 5 rows, etc. :

   INAMI_key code_qualif n_adresses travail_rue_total                travail_code_postal_total
 1  30000120           1          2 RUE VAN ARTEVELDE                                     1000
 2  30000120           1          2 paul pastur                                           6180
 3  30000417           1          1 av Margueritr depasse                                 6060
 4  37603435           7          1 du Grand Veneur                                       1170
 5  38300152           7          2 RUE WAUTERS                92-94                      6040
 6  38300152           7          2 de châtelet                                           6120
 7  38707849           1          3 de Campine                                            4000
 8  38707849           1          3 de Chestret                                           4000
 9  38707849           1          3 de la Gare                                            4020
10  38813856           1          3 Torhoutste steenweg                                   8400
11  38813856           1          3 Lage Kaart                                            2930
12  38813856           1          3 De Vrièrestraat                                       8301
13  38811084           1          4 chaussée de Waterloo                                  1180
14  38811084           1          4 avenue Napoléon                                       1420
15  38811084           1          4 rue Léon Théodor                                      1090
16  38811084           1          4 avenue de la Basilique                                1081
17  39105054           1          4 EMILE  CLAUS                                          1050
18  39105054           1          4 RUE DU FOYER SCHAERBEEKOIS                            1030
19  39105054           1          4 RUE XAVIER DE BUE                                     1180
20  39105054           1          4 BV LAMBERMONT                                         1030
21  39117031           1          5 KERKSTRAAT                                            3850
22  39117031           1          5 Pater Richard van de Wouwerstraa                      3271
23  39117031           1          5 Wilderenlaan                                          3803
24  39117031           1          5 Gyzevennestraat                                       3560
25  39117031           1          5 Molenveldstraat                                       3500
26  31823918          70          5 Route de l'Etat                                       1380
27  31823918          70          5 Avenue Paul Hymans                                    1200
28  31823918          70          5 Avenue WInston Churchill                              1180
29  31823918          70          5 Avenue Winston Churchill                              1180
30  31823918          70          5 avenue hippocrate                                     1200

That's a simplified version of the data. In the whole database, i have 40 adresses x 15 variables for each (street, number, city, postal code, institution...).

Thank you !


Solution

  • You can use pivot_longer to get data in long format and filter to drop the empty values.

    library(dplyr)
    library(tidyr)
    
    df %>%
      pivot_longer(cols = starts_with('travail'), 
                   names_to = '.value', 
                   names_pattern = 'travail_(.*?)\\d+') %>%
      filter(rue != '')
    
    #  INAMI_key code_qualif n_adresses rue                                code_postal
    #   <chr>     <chr>       <chr>      <chr>                              <chr>      
    # 1 30000120  001         02         "RUE VAN ARTEVELDE               " " 1000"    
    # 2 30000120  001         02         "paul pastur                     " " 6180"    
    # 3 30000417  001         01         "av Margueritr depasse           " " 6060"    
    # 4 37603435  007         01         "du Grand Veneur                 " " 1170"    
    # 5 38300152  007         02         "RUE WAUTERS                92-94" " 6040"    
    # 6 38300152  007         02         "de châtelet                     " " 6120"    
    # 7 38707849  001         03         "de Campine                      " " 4000"    
    # 8 38707849  001         03         "de Chestret                     " " 4000"    
    # 9 38707849  001         03         "de la Gare                      " " 4020"    
    #10 38813856  001         03         "Torhoutste steenweg             " " 8400"    
    # … with 20 more rows