Search code examples
rdatabasedataframedata-cleaning

Unravel Data into a more friendly DataFrame


Sorry if this has been asked before, couldn't find the forum cause I'm not even sure how to look this up. But this is my problem, I have this dataframe in R:

Area        Item                                                              Year      Unit  Value
   <chr>       <chr>                                                             <chr>     <chr> <chr>
 1 Afghanistan Average dietary energy supply adequacy (percent) (3-year average) 2000-2002 %     87   
 2 Afghanistan Average dietary energy supply adequacy (percent) (3-year average) 2001-2003 %     88   
 3 Afghanistan Average dietary energy supply adequacy (percent) (3-year average) 2002-2004 %     91   
 4 Afghanistan Average dietary energy supply adequacy (percent) (3-year average) 2003-2005 %     92   
 5 Afghanistan Average dietary energy supply adequacy (percent) (3-year average) 2004-2006 %     92   
 6 Afghanistan Average dietary energy supply adequacy (percent) (3-year average) 2005-2007 %     94   
 7 Afghanistan Average dietary energy supply adequacy (percent) (3-year average) 2006-2008 %     95   
 8 Afghanistan Average dietary energy supply adequacy (percent) (3-year average) 2007-2009 %     97   
 9 Afghanistan Average dietary energy supply adequacy (percent) (3-year average) 2008-2010 %     100  
10 Afghanistan Average dietary energy supply adequacy (percent) (3-year average) 2009-2011 %     102

There are 51 items in the data frame, but some areas and some years don't have certain items. I would like an outcome as follows to be able to play with Correlation matrixes, heatmaps, Data Viz, etc., but I don't know how:

Area         Year       Item1 Item2 ... Item52
Afghanistan  2000-2002  87    NA    ... NA
Afghanistan  2001-2002  NA*    88    ... NA  

*It could happen that there is an Item1 value for 2001-2002 in Afghanistan, but for the example I didn't put it.

Where Item-i is the name of the 51 different items and the dataframe is filled with NAs where the value for that Item in that area and year isn't meassured.

Thanks!


Solution

  • Given your explanation i assume the data is sorted i.e. 51 elements with possible missings as NA in order.

    df<-data.frame(Area=c(rep("Afghanistan", 51*12),
                         rep("Pakistan", 51*12)), 
                  Item=paste("Average dieatary item", rep(1:51, each=12)), 
                  Year = rep(paste(2000:2011, 2002:2013, sep="-"), 51),
                  Value = c(87,88,91,92,92,94,95,97,100,102,200,300, sample(100, 51*2*12-12,T)))
    
    result <-do.call(rbind, by(df, list(df$Year, df$Area), function(x) {
      data <- data.frame(Area = unique(x$Area), Year = unique(x$Year), t(x$Value))
      colnames(data)[3:53] = paste("Item",1:51)
      data
    }))
    
    print(head(result[c(1,2,3:5,50:51)]))
    #>          Area      Year Item 1 Item 2 Item 3 Item 48 Item 49
    #> 1 Afghanistan 2000-2002     87     91     50      52      10
    #> 2 Afghanistan 2001-2003     88     20     91      46      67
    #> 3 Afghanistan 2002-2004     91     30     15      88      83
    #> 4 Afghanistan 2003-2005     92     74     21      29      17
    #> 5 Afghanistan 2004-2006     92     87     65      71      66
    #> 6 Afghanistan 2005-2007     94     58     41      46      49
    

    other way with dplyr/tidyr

    library(dplyr)
    library(tidyr)
    
    result2 <- df %>% group_by(Year, Area) %>% mutate(id= 1:n()) %>% 
      select(Area, id, Year, Value) %>% 
      pivot_wider(c(Area, Year), names_from = id, names_prefix = "Item", values_from = Value)
    
    print(head(result2[c(1,2,3:5,50:51)]))
    #> # A tibble: 6 × 7
    #> # Groups:   Year, Area [6]
    #>   Area        Year      Item1 Item2 Item3 Item48 Item49
    #>   <chr>       <chr>     <dbl> <dbl> <dbl>  <dbl>  <dbl>
    #> 1 Afghanistan 2000-2002    87    91    50     52     10
    #> 2 Afghanistan 2001-2003    88    20    91     46     67
    #> 3 Afghanistan 2002-2004    91    30    15     88     83
    #> 4 Afghanistan 2003-2005    92    74    21     29     17
    #> 5 Afghanistan 2004-2006    92    87    65     71     66
    #> 6 Afghanistan 2005-2007    94    58    41     46     49
    

    Created on 2022-11-07 with reprex v2.0.2