I am trying to reshape data but keep having errors. There are very helpful threads here on reshaping but I have not seen my particular issue. I am trying to go from long to wide - but have both country name and date as the id - typically I have seen solutions for making date the new columns.
#Data from:
library(wbstats)
WorldBank_long <- wb(indicator = c("NY.GDP.PCAP.KD", "SI.POV.GINI", "UNEMPSA_","CPTOTSAXMZGY" ),
startdate = 2005, enddate = 2019)
WorldBank <- WorldBank_long[,c("iso3c", "date", "value", "indicatorID")]
Columns in "WorldBank" are "iso3c" & "date" - which are both ID variables. "indicatorID" is the value I want to make new column names and "value" is the value I want for each new column.
I have tried:
library(tidyr)
WorldBank_data_wide<-WorldBank %>%
pivot_wider(names_from = indicatorID, values_from = value)
Error: Each row of output must be identified by a unique combination of keys. Keys are shared for 403 rows:
library(data.table)
WorldBank_data_wide = dcast(WorldBank_long, date + iso3c ~indicator, value.var = 'value')
Error: Aggregation function missing: defaulting to length --- The columns were what I wanted BUT instead of the values, the values were 0 or 1.
EDITS: Current Table
iso3c date value indicatorID
1 ARB 2019 6437.167 NY.GDP.PCAP.KD
2 ARB 2018 6465.474 NY.GDP.PCAP.KD
3 ARB 2017 6454.460 NY.GDP.PCAP.KD
4 ARB 2016 6506.271 NY.GDP.PCAP.KD
5 ARB 2015 6418.029 NY.GDP.PCAP.KD
6 ARB 2014 6350.838 NY.GDP.PCAP.KD
7 ARB 2013 6333.027 NY.GDP.PCAP.KD
8 ARB 2012 6274.111 NY.GDP.PCAP.KD
9 ARB 2011 6020.487 NY.GDP.PCAP.KD
10 ARB 2010 5934.291 NY.GDP.PCAP.KD
11 ARB 2009 5806.185 NY.GDP.PCAP.KD
What I want to do -- only instead of 0 and 1 to have the values from above (this is my output from (data.table):
date iso3c CPTOTSAXMZGY NY.GDP.PCAP.KD SI.POV.GINI UNEMPSA_
1 2005 ABW 0 1 0 0
2 2005 AFG 0 1 0 0
3 2005 AGO 0 1 0 0
4 2005 ALB 0 1 1 0
5 2005 AND 0 1 0 0
6 2005 ARB 0 1 0 0
7 2005 ARE 0 1 0 0
8 2005 ARG 0 1 1 1
9 2005 ARM 0 1 1 1
10 2005 ASM 0 1 0 0
11 2005 ATG 0 1 0 0
EDIT 2
The new command looks like it is better than all other output - but all my values (i.e. GDP per capita or Gini) now are repeated so that all other columns are NA except for the one that is being defined. But scrolling down then GDP becomes NA and Gini has its values.
#using new command:
WorldBank_data_wide<-WorldBank %>%
mutate(row = row_number()) %>% pivot_wider(names_from = indicatorID, values_from = value)
#output
iso3c date row. NY.GDP.PCAP.KDSI.POV.GINIUNEMPSA_CPTOTSAXMZGY
2138 BDI 2019 1126 208.0747 NA NA NA
2139 BDI 2018 1127 210.8042 NA NA NA
2140 BDI 2017 1128 214.1392 NA NA NA
2141 BDI 2016 1129 219.9615 NA NA NA
2152 BDI 2005 1140 221.0964 NA NA NA
2151 BDI 2006 1139 225.6222 NA NA NA
2150 BDI 2007 1138 225.8591 NA NA NA
2142 BDI 2015 1130 228.4325 NA NA NA
2149 BDI 2008 1137 229.1485 NA NA NA
2148 BDI 2009 1136 230.1926 NA NA NA
2592 ETH 2005 1580 233.9442 NA NA NA
Edit:
Another unforeseen issue - Some commands are working but instead of each row being a unique country and year - some are repeated so that there are multiple rows for Argentina in 2013 (for example)
World_bank_wide <- WorldBank %>%
group_by(iso3c,indicatorID) %>%
mutate(row_id=1:n()) %>% ungroup() %>%
spread(indicatorID, value) %>%
select(-row_id)
row iso3c date CPTOTSAXMZGY NY.GDP.PCAP.KD SI.POV.GINI UNEMPSA_
1 ABW 2005 NA 26979.8854 NA NA
104 ARE 2011 NA 34634.862 NA NA
105 ARE 2012 NA 35416.892 NA NA
106 ARE 2013 NA 36978.833 NA NA
107 ARE 2014 NA NA 32.5 NA
108 ARE 2014 NA 38495.046 NA NA
109 ARE 2015 NA 40247.747 NA NA
110 ARE 2016 NA 41045.111 NA NA
111 ARE 2017 NA 41460.283 NA NA
Based on some of your edits above, I think you may be looking for something like this. First you are creating a variable id
, so that each row has a unique value (otherwise, you would return an error while using pivot_wider
). Then, you are spreading your data from long to wide using pivot_wider
, and then dropping the id
variable. Finally, you're only selecting unique row values (dropping duplicates), and then sorting by iso3c
and date
.
Edit: Since you need to aggregate across rows, where NA
values span your numeric columns, you can summarise
where values are not NA
to get one row per iso3c
per date
. Then you can replace your NA
numeric values with 0.
library(wbstats)
library(tidyverse)
WorldBank_long <- wb(indicator = c("NY.GDP.PCAP.KD", "SI.POV.GINI", "UNEMPSA_","CPTOTSAXMZGY" ),
startdate = 2005, enddate = 2019)
WorldBank <- WorldBank_long[,c("iso3c", "date", "value", "indicatorID")]
WorldBank %>%
mutate(id = row_number()) %>%
pivot_wider(names_from = indicatorID, values_from = value) %>%
select(-id) %>%
distinct() %>%
arrange(iso3c, date) %>%
group_by(iso3c, date) %>%
mutate_all(as.character) %>%
summarise(across(.cols = everything(), .fns = ~ .[!is.na(.)][1])) %>%
mutate_at(c("NY.GDP.PCAP.KD","SI.POV.GINI", "UNEMPSA_", "CPTOTSAXMZGY"), as.numeric) %>%
mutate(across(where(is.numeric), ~replace_na(., 0)))
This gives you a unique value for each iso3c
and date
, and fills all NA
values with 0:
iso3c date NY.GDP.PCAP.KD SI.POV.GINI UNEMPSA_ CPTOTSAXMZGY
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 ARG 2005 8578. 48 11.6 0
2 ARG 2006 9175. 46.7 10.2 0
3 ARG 2007 9902. 46.6 8.45 0
4 ARG 2008 10201. 45.3 7.86 0
5 ARG 2009 9502. 44.1 8.67 0
6 ARG 2010 10386. 44.5 7.75 0
7 ARG 2011 10883. 42.7 7.15 0
8 ARG 2012 10650. 41.4 7.21 0
9 ARG 2013 10785. 41 7.08 0
10 ARG 2014 10399. 41.7 7.27 0
11 ARG 2015 10568. 0 6.61 0
12 ARG 2016 10239. 42 8.47 0
13 ARG 2017 10404. 41.2 8.35 0
14 ARG 2018 10044. 41.4 9.24 0
15 ARG 2019 9729. 0 0 0