Search code examples
rtidyrreshape2

Convert semi-long data into wide data


I'm very sure there should be a simple alternative but I'm not able to figure it out. Currently using a for loop which is not optimal. My dataframe is like this:

NAME <- c("ABC", "ABC", "ABC", "DEF", "GHI", "GHI", "JKL", "JKL", "JKL", "MNO")
YEAR <- c(2012, 2013, 2014, 2012, 2012, 2013, 2012, 2014, 2016, 2013)
MARKS <- c(45, 75, 95, 91, 75, 76, 85, 88, 89, 77)
MAXIMUM <- c(95, NA, NA, 91, 76, NA, 89, NA, NA, 77)

DF <- data.frame(
  NAME,
  YEAR,
  MARKS,
  MAXIMUM
)

> DF
   NAME YEAR MARKS MAXIMUM
1   ABC 2012    45      95
2   ABC 2013    75      NA
3   ABC 2014    95      NA
4   DEF 2012    91      91
5   GHI 2012    75      76
6   GHI 2013    76      NA
7   JKL 2012    85      89
8   JKL 2014    88      NA
9   JKL 2016    89      NA
10  MNO 2013    77      77

I want to have only one name per row and each year-wise details (YEAR, MARKS and MAXIMUM columns) should be spread as individual headers. I have tried to use tidyr::pivot_wider function but was not successful.

I have given the sample output here:

Required output enter image description here


Solution

  • I think all the previous answers have overlooked that the expected output is based on YEAR as a factor. The expected output has 4 grouped-columns per row, not 3. Therefore, you avoid mixing different years in the same column.

    You can assign a number for every row- grp - based on the level of Year as a factor(). Also, if you first pivot longer, you can arrange the values as you want and then pivot wider everything so the columns are sorted as you expect:

    library(tidyverse)
    DF %>% 
      mutate(grp = as.integer(factor(DF$YEAR,unique(DF$YEAR)))) %>% 
      pivot_longer(cols=c('YEAR','MARKS','MAXIMUM'), names_to = 'COLNAMES', values_to= 'COL_VALUES') %>%
      arrange(NAME,grp) %>% 
      pivot_wider(names_from = c(COLNAMES,grp), values_from= COL_VALUES, names_sep = '')
    

    Output:

    # A tibble: 5 x 13
      NAME  YEAR1 MARKS1 MAXIMUM1 YEAR2 MARKS2 MAXIMUM2 YEAR3 MARKS3 MAXIMUM3 YEAR4 MARKS4 MAXIMUM4
      <chr> <dbl>  <dbl>    <dbl> <dbl>  <dbl>    <dbl> <dbl>  <dbl>    <dbl> <dbl>  <dbl>    <dbl>
    1 ABC    2012     45       95  2013     75       NA  2014     95       NA    NA     NA       NA
    2 DEF    2012     91       91    NA     NA       NA    NA     NA       NA    NA     NA       NA
    3 GHI    2012     75       76  2013     76       NA    NA     NA       NA    NA     NA       NA
    4 JKL    2012     85       89    NA     NA       NA  2014     88       NA  2016     89       NA
    5 MNO      NA     NA       NA  2013     77       77    NA     NA       NA    NA     NA       NA
    

    However, I suggest you to keep track of the years to not make the tibble more confusing:

    DF$YEAR = factor(DF$YEAR)
    
    DF %>% 
      pivot_longer(cols=c('MARKS','MAXIMUM'), names_to = 'COLNAMES', values_to= 'COL_VALUES') %>%
      arrange(NAME,YEAR) %>% 
      pivot_wider(names_from = c(COLNAMES,YEAR), values_from= COL_VALUES)
    
    # A tibble: 5 x 9
      NAME  MARKS_2012 MAXIMUM_2012 MARKS_2013 MAXIMUM_2013 MARKS_2014 MAXIMUM_2014 MARKS_2016 MAXIMUM_2016
      <chr>      <dbl>        <dbl>      <dbl>        <dbl>      <dbl>        <dbl>      <dbl>        <dbl>
    1 ABC           45           95         75           NA         95           NA         NA           NA
    2 DEF           91           91         NA           NA         NA           NA         NA           NA
    3 GHI           75           76         76           NA         NA           NA         NA           NA
    4 JKL           85           89         NA           NA         88           NA         89           NA
    5 MNO           NA           NA         77           77         NA           NA         NA           NA