Search code examples
rdataframedata-cleaning

Is there an R function for reshaping data --World Bank


I am trying to reshape data but keep having errors. There are very helpful threads here on reshaping but I have not seen my particular issue. I am trying to go from long to wide - but have both country name and date as the id - typically I have seen solutions for making date the new columns.

#Data from:
library(wbstats)
WorldBank_long <- wb(indicator = c("NY.GDP.PCAP.KD", "SI.POV.GINI", "UNEMPSA_","CPTOTSAXMZGY" ),
                     startdate = 2005, enddate = 2019)
WorldBank <- WorldBank_long[,c("iso3c", "date", "value", "indicatorID")]

Columns in "WorldBank" are "iso3c" & "date" - which are both ID variables. "indicatorID" is the value I want to make new column names and "value" is the value I want for each new column.

I have tried:

library(tidyr)
WorldBank_data_wide<-WorldBank %>% 
  pivot_wider(names_from = indicatorID, values_from = value)

Error: Each row of output must be identified by a unique combination of keys. Keys are shared for 403 rows:

  • 6086, 6101, 6116, 6131, 6146, 6176, 6191, 6206, 6221, 6236, 6251, 6266, 6296, 6431
  • 6085, 6100, 6115, 6130, 6145, 6175, 6190, 6205, 6220, 6235, 6250, 6265, 6295, 6430
  • 6084, 6099, 6114, 6129, 6144, 6174, 6189, 6204, 6219, 6234, 6249, 6264, 6294, 6429
  • 6083, 6098, 6113, 6128, 6143, 6173, 6188, 6203, 6218, 6233, 6248, 6263, 6293, 6428
  • 6082, 6097, 6112, 6127, 6142, 6172, 6187, 6202, 6217, 6232, 6247, 6262, 6292, 6427
  • 6081, 6096, 6111, 6126, 6141, 6171, 6186, 6201, 6216, 6231, 6246, 6261, 6291, 6426
  • 6080, 6095, 6110, 6125, 6140, 6170, 6185, 6200, 6215, 6230, 6245, 6260, 6290, 6425
  • 6079, 6094, 6109, 6124, 6139, 6169, 6184, 6199, 6214, 6229, 6244, 6259, 6289, 6424
  • 6078, 6093, 6108, 6123, 6138, 6168, 6183, 6198, 6213, 6228, 6243, 6258, 6288, 6423
  • 6077, 6092, 6107, 6122, 6137, 6167, 6182, 6197, 6212, 6227, 6242, 6257, 6287, 6422
  • 6076, 6091, 6106, 6121, 6136, 6166, 618
library(data.table)
WorldBank_data_wide = dcast(WorldBank_long, date + iso3c ~indicator, value.var = 'value')

Error: Aggregation function missing: defaulting to length --- The columns were what I wanted BUT instead of the values, the values were 0 or 1.

EDITS: Current Table

iso3c    date        value    indicatorID
1   ARB 2019    6437.167    NY.GDP.PCAP.KD
2   ARB 2018    6465.474    NY.GDP.PCAP.KD
3   ARB 2017    6454.460    NY.GDP.PCAP.KD
4   ARB 2016    6506.271    NY.GDP.PCAP.KD
5   ARB 2015    6418.029    NY.GDP.PCAP.KD
6   ARB 2014    6350.838    NY.GDP.PCAP.KD
7   ARB 2013    6333.027    NY.GDP.PCAP.KD
8   ARB 2012    6274.111    NY.GDP.PCAP.KD
9   ARB 2011    6020.487    NY.GDP.PCAP.KD
10  ARB 2010    5934.291    NY.GDP.PCAP.KD
11  ARB 2009    5806.185    NY.GDP.PCAP.KD

What I want to do -- only instead of 0 and 1 to have the values from above (this is my output from (data.table):

     date iso3c CPTOTSAXMZGY NY.GDP.PCAP.KD SI.POV.GINI UNEMPSA_
1   2005    ABW 0   1   0   0
2   2005    AFG 0   1   0   0
3   2005    AGO 0   1   0   0
4   2005    ALB 0   1   1   0
5   2005    AND 0   1   0   0
6   2005    ARB 0   1   0   0
7   2005    ARE 0   1   0   0
8   2005    ARG 0   1   1   1
9   2005    ARM 0   1   1   1
10  2005    ASM 0   1   0   0
11  2005    ATG 0   1   0   0

EDIT 2

The new command looks like it is better than all other output - but all my values (i.e. GDP per capita or Gini) now are repeated so that all other columns are NA except for the one that is being defined. But scrolling down then GDP becomes NA and Gini has its values.

#using new command:

WorldBank_data_wide<-WorldBank %>%   
 mutate(row = row_number()) %>%   pivot_wider(names_from = indicatorID, values_from = value)

#output
  iso3c date row.  NY.GDP.PCAP.KDSI.POV.GINIUNEMPSA_CPTOTSAXMZGY
2138    BDI 2019    1126    208.0747    NA  NA  NA
2139    BDI 2018    1127    210.8042    NA  NA  NA
2140    BDI 2017    1128    214.1392    NA  NA  NA
2141    BDI 2016    1129    219.9615    NA  NA  NA
2152    BDI 2005    1140    221.0964    NA  NA  NA
2151    BDI 2006    1139    225.6222    NA  NA  NA
2150    BDI 2007    1138    225.8591    NA  NA  NA
2142    BDI 2015    1130    228.4325    NA  NA  NA
2149    BDI 2008    1137    229.1485    NA  NA  NA
2148    BDI 2009    1136    230.1926    NA  NA  NA
2592    ETH 2005    1580    233.9442    NA  NA  NA

Edit:

Another unforeseen issue - Some commands are working but instead of each row being a unique country and year - some are repeated so that there are multiple rows for Argentina in 2013 (for example)

World_bank_wide <- WorldBank %>%
  group_by(iso3c,indicatorID) %>%
  mutate(row_id=1:n()) %>% ungroup() %>%
  spread(indicatorID, value) %>%
  select(-row_id)

row iso3c date CPTOTSAXMZGY NY.GDP.PCAP.KD SI.POV.GINI UNEMPSA_
1   ABW 2005    NA  26979.8854  NA  NA
104 ARE 2011    NA  34634.862   NA  NA
105 ARE 2012    NA  35416.892   NA  NA
106 ARE 2013    NA  36978.833   NA  NA
107 ARE 2014    NA  NA  32.5    NA
108 ARE 2014    NA  38495.046   NA  NA
109 ARE 2015    NA  40247.747   NA  NA
110 ARE 2016    NA  41045.111   NA  NA
111 ARE 2017    NA  41460.283   NA  NA


Solution

  • Based on some of your edits above, I think you may be looking for something like this. First you are creating a variable id, so that each row has a unique value (otherwise, you would return an error while using pivot_wider). Then, you are spreading your data from long to wide using pivot_wider, and then dropping the id variable. Finally, you're only selecting unique row values (dropping duplicates), and then sorting by iso3c and date.

    Edit: Since you need to aggregate across rows, where NA values span your numeric columns, you can summarise where values are not NA to get one row per iso3c per date. Then you can replace your NA numeric values with 0.

    library(wbstats)
    library(tidyverse)
    
    WorldBank_long <- wb(indicator = c("NY.GDP.PCAP.KD", "SI.POV.GINI", "UNEMPSA_","CPTOTSAXMZGY" ),
                         startdate = 2005, enddate = 2019)
    WorldBank <- WorldBank_long[,c("iso3c", "date", "value", "indicatorID")]
    
    
    WorldBank %>% 
      mutate(id = row_number()) %>% 
      pivot_wider(names_from = indicatorID, values_from = value) %>%
      select(-id) %>% 
      distinct() %>% 
      arrange(iso3c, date) %>% 
      group_by(iso3c, date) %>% 
      mutate_all(as.character) %>% 
      summarise(across(.cols = everything(), .fns = ~ .[!is.na(.)][1])) %>% 
      mutate_at(c("NY.GDP.PCAP.KD","SI.POV.GINI", "UNEMPSA_", "CPTOTSAXMZGY"), as.numeric) %>% 
      mutate(across(where(is.numeric), ~replace_na(., 0)))
    

    This gives you a unique value for each iso3c and date, and fills all NA values with 0:

    iso3c date  NY.GDP.PCAP.KD SI.POV.GINI UNEMPSA_ CPTOTSAXMZGY
       <chr> <chr>          <dbl>       <dbl>    <dbl>        <dbl>
     1 ARG   2005           8578.        48      11.6             0
     2 ARG   2006           9175.        46.7    10.2             0
     3 ARG   2007           9902.        46.6     8.45            0
     4 ARG   2008          10201.        45.3     7.86            0
     5 ARG   2009           9502.        44.1     8.67            0
     6 ARG   2010          10386.        44.5     7.75            0
     7 ARG   2011          10883.        42.7     7.15            0
     8 ARG   2012          10650.        41.4     7.21            0
     9 ARG   2013          10785.        41       7.08            0
    10 ARG   2014          10399.        41.7     7.27            0
    11 ARG   2015          10568.         0       6.61            0
    12 ARG   2016          10239.        42       8.47            0
    13 ARG   2017          10404.        41.2     8.35            0
    14 ARG   2018          10044.        41.4     9.24            0
    15 ARG   2019           9729.         0       0               0