Search code examples
rpanel

Transfprm wide data into long by mutationg some variable and keeping some variable as same as before


I have a dataset of 6000 villages with variables like population(which I keep constant according to base year). There are three more variables project_1, project_2 and project_3 which gives the details when the project was implemented in this village. This is how the data looks like.

| village | population | project_1 | project_2 | project_3 |
|---------|------------|-----------|-----------|-----------|
| A       | 100        | 2002      |           |           |
| B       | 200        |           | 2003      | 2002      |
| C       | 150        | 2004      |           |           |
| D       | 175        |           |           | 2005      |

I want to transform this data into long-form(look below). So the basically the project variable becomes a dummy and take a value =1 when the project is implemented and remains equal to 1 thereafter.

| village | population | year | project_1 | project_2 | project_3 |
|---------|------------|------|-----------|-----------|-----------|
| A       | 100        | 2001 | 0         | 0         | 0         |
| A       | 100        | 2002 | 1         | 0         | 0         |
| A       | 100        | 2003 | 1         | 0         | 0         |
| A       | 100        | 2004 | 1         | 0         | 0         |
| A       | 100        | 2005 | 1         | 0         | 0         |
| B       | 200        | 2001 | 0         | 0         | 0         |
| B       | 200        | 2002 | 0         | 0         | 1         |
| B       | 200        | 2003 | 0         | 1         | 1         |
| B       | 200        | 2004 | 0         | 1         | 1         |
| B       | 200        | 2005 | 0         | 1         | 1         |
| C       | 150        | 2001 | 0         | 0         | 0         |
| C       | 150        | 2002 | 0         | 0         | 0         |
| C       | 150        | 2003 | 0         | 0         | 0         |
| C       | 150        | 2004 | 1         | 0         | 0         |
| C       | 150        | 2005 | 1         | 0         | 0         |
| D       | 175        | 2001 | 0         | 0         | 0         |
| D       | 175        | 2002 | 0         | 0         | 0         |
| D       | 175        | 2003 | 0         | 0         | 0         |
| D       | 175        | 2004 | 0         | 0         | 0         |
| D       | 175        | 2005 | 0         | 0         | 1         |

I have tried this code but so far it's not working.

temp_long <- reshape(data = temp,
             idvar= "village",
             varying = 3:5,
             sep= "",
             timevar= "year",
             times = c(2001,2002,2003,2004,2005),
             new.row.names= 1:100000,
             direction = "long")


Solution

  • Suppose your data are:

    df <- read_table2("village population project_1 project_2 project_3
    A 100 2002 NA NA
    B 200 NA 2003 2002
    C 150 2004 NA NA
    D 175 NA NA 2005")
    

    Using dplyr:

    df %>%
      merge(expand.grid(year=2001:2005, village=.$village), by="village") %>%
      mutate(across(starts_with("project_"), ~ as.numeric(replace_na(.x <= year, 0)))) %>%
      select(village, population, year, starts_with("pro"))
    

    yields

       village population year project_1 project_2 project_3
    1        A        100 2001         0         0         0
    2        A        100 2002         1         0         0
    3        A        100 2003         1         0         0
    4        A        100 2004         1         0         0
    5        A        100 2005         1         0         0
    6        B        200 2001         0         0         0
    7        B        200 2002         0         0         1
    8        B        200 2003         0         1         1
    9        B        200 2004         0         1         1
    10       B        200 2005         0         1         1
    11       C        150 2001         0         0         0
    12       C        150 2002         0         0         0
    13       C        150 2003         0         0         0
    14       C        150 2004         1         0         0
    15       C        150 2005         1         0         0
    16       D        175 2001         0         0         0
    17       D        175 2002         0         0         0
    18       D        175 2003         0         0         0
    19       D        175 2004         0         0         0
    20       D        175 2005         0         0         1
    

    With your dput data

    df2 <- structure(list(key = c("057091", "057296", "057802", "057806",  "058105", "058309"), TOT_POP = c(795, 378, 669, 3760, 55, 933 ), road_comp_date_upg_year_final = c(2009, 2004, 2006, 2006,  2008, 2012), road_award_date_upg_year_final = c(2008, 2003, 2005,  2005, 2007, 2010), road_comp_date_stip_upg_year_final = c(2009,  2003, 2006, 2006, 2008, 2011)), row.names = c(NA, 6L), class = "data.frame")
    

    and the adjusted code

    df2 %>%
      merge(expand.grid(year=2001:2015, key=.$key), by="key") %>% 
      mutate(across(starts_with("road_"), ~ as.numeric(replace_na(.x <= year, 0)))) %>%
      select(key, TOT_POP, year, starts_with("road"))
    

    creates

          key TOT_POP year road_comp_date_upg_year_final road_award_date_upg_year_final road_comp_date_stip_upg_year_final
    1  057091     795 2001                             0                              0                                  0
    2  057091     795 2002                             0                              0                                  0
    3  057091     795 2003                             0                              0                                  0
    4  057091     795 2004                             0                              0                                  0
    5  057091     795 2005                             0                              0                                  0
    6  057091     795 2006                             0                              0                                  0
    7  057091     795 2007                             0                              0                                  0
    8  057091     795 2008                             0                              1                                  0
    9  057091     795 2009                             1                              1                                  1
    10 057091     795 2010                             1                              1                                  1
    11 057091     795 2011                             1                              1                                  1
    12 057091     795 2012                             1                              1                                  1
    13 057091     795 2013                             1                              1                                  1
    14 057091     795 2014                             1                              1                                  1
    15 057091     795 2015                             1                              1                                  1
    16 057296     378 2001                             0                              0                                  0
    17 057296     378 2002                             0                              0                                  0
    18 057296     378 2003                             0                              1                                  1
    19 057296     378 2004                             1                              1                                  1
    20 057296     378 2005                             1                              1                                  1