I have a dataset of 6000 villages with variables like population(which I keep constant according to base year). There are three more variables project_1, project_2 and project_3 which gives the details when the project was implemented in this village. This is how the data looks like.
| village | population | project_1 | project_2 | project_3 |
|---------|------------|-----------|-----------|-----------|
| A | 100 | 2002 | | |
| B | 200 | | 2003 | 2002 |
| C | 150 | 2004 | | |
| D | 175 | | | 2005 |
I want to transform this data into long-form(look below). So the basically the project variable becomes a dummy and take a value =1 when the project is implemented and remains equal to 1 thereafter.
| village | population | year | project_1 | project_2 | project_3 |
|---------|------------|------|-----------|-----------|-----------|
| A | 100 | 2001 | 0 | 0 | 0 |
| A | 100 | 2002 | 1 | 0 | 0 |
| A | 100 | 2003 | 1 | 0 | 0 |
| A | 100 | 2004 | 1 | 0 | 0 |
| A | 100 | 2005 | 1 | 0 | 0 |
| B | 200 | 2001 | 0 | 0 | 0 |
| B | 200 | 2002 | 0 | 0 | 1 |
| B | 200 | 2003 | 0 | 1 | 1 |
| B | 200 | 2004 | 0 | 1 | 1 |
| B | 200 | 2005 | 0 | 1 | 1 |
| C | 150 | 2001 | 0 | 0 | 0 |
| C | 150 | 2002 | 0 | 0 | 0 |
| C | 150 | 2003 | 0 | 0 | 0 |
| C | 150 | 2004 | 1 | 0 | 0 |
| C | 150 | 2005 | 1 | 0 | 0 |
| D | 175 | 2001 | 0 | 0 | 0 |
| D | 175 | 2002 | 0 | 0 | 0 |
| D | 175 | 2003 | 0 | 0 | 0 |
| D | 175 | 2004 | 0 | 0 | 0 |
| D | 175 | 2005 | 0 | 0 | 1 |
I have tried this code but so far it's not working.
temp_long <- reshape(data = temp,
idvar= "village",
varying = 3:5,
sep= "",
timevar= "year",
times = c(2001,2002,2003,2004,2005),
new.row.names= 1:100000,
direction = "long")
Suppose your data are:
df <- read_table2("village population project_1 project_2 project_3
A 100 2002 NA NA
B 200 NA 2003 2002
C 150 2004 NA NA
D 175 NA NA 2005")
Using dplyr
:
df %>%
merge(expand.grid(year=2001:2005, village=.$village), by="village") %>%
mutate(across(starts_with("project_"), ~ as.numeric(replace_na(.x <= year, 0)))) %>%
select(village, population, year, starts_with("pro"))
yields
village population year project_1 project_2 project_3
1 A 100 2001 0 0 0
2 A 100 2002 1 0 0
3 A 100 2003 1 0 0
4 A 100 2004 1 0 0
5 A 100 2005 1 0 0
6 B 200 2001 0 0 0
7 B 200 2002 0 0 1
8 B 200 2003 0 1 1
9 B 200 2004 0 1 1
10 B 200 2005 0 1 1
11 C 150 2001 0 0 0
12 C 150 2002 0 0 0
13 C 150 2003 0 0 0
14 C 150 2004 1 0 0
15 C 150 2005 1 0 0
16 D 175 2001 0 0 0
17 D 175 2002 0 0 0
18 D 175 2003 0 0 0
19 D 175 2004 0 0 0
20 D 175 2005 0 0 1
With your dput
data
df2 <- structure(list(key = c("057091", "057296", "057802", "057806", "058105", "058309"), TOT_POP = c(795, 378, 669, 3760, 55, 933 ), road_comp_date_upg_year_final = c(2009, 2004, 2006, 2006, 2008, 2012), road_award_date_upg_year_final = c(2008, 2003, 2005, 2005, 2007, 2010), road_comp_date_stip_upg_year_final = c(2009, 2003, 2006, 2006, 2008, 2011)), row.names = c(NA, 6L), class = "data.frame")
and the adjusted code
df2 %>%
merge(expand.grid(year=2001:2015, key=.$key), by="key") %>%
mutate(across(starts_with("road_"), ~ as.numeric(replace_na(.x <= year, 0)))) %>%
select(key, TOT_POP, year, starts_with("road"))
creates
key TOT_POP year road_comp_date_upg_year_final road_award_date_upg_year_final road_comp_date_stip_upg_year_final
1 057091 795 2001 0 0 0
2 057091 795 2002 0 0 0
3 057091 795 2003 0 0 0
4 057091 795 2004 0 0 0
5 057091 795 2005 0 0 0
6 057091 795 2006 0 0 0
7 057091 795 2007 0 0 0
8 057091 795 2008 0 1 0
9 057091 795 2009 1 1 1
10 057091 795 2010 1 1 1
11 057091 795 2011 1 1 1
12 057091 795 2012 1 1 1
13 057091 795 2013 1 1 1
14 057091 795 2014 1 1 1
15 057091 795 2015 1 1 1
16 057296 378 2001 0 0 0
17 057296 378 2002 0 0 0
18 057296 378 2003 0 1 1
19 057296 378 2004 1 1 1
20 057296 378 2005 1 1 1