Search code examples
rreshapereshape2dcast

Using dcast to cast a wide format when the variables per time period are not uniquely specified


I would like to convert this csv file into a long format. It currently look as follows:

enter image description here

Every ID is listed 1 time for each years, so a total of 7 times.

What I would like to do is to for each ID have one row, where the variables are listed as logwks1 + logwks2 + .. + logwks6 + logwks7.

I have started by simply melting with wagem <- melt(wage, id = "ID")

I however don't understand how to cast it in order to have the desired format.

I tried to do wagec <- dcast(wagem, ID ~ variable), but then it counts the observations as the default (I think because it has no way to know how to cast them otherwise).

How do I fix this?


Solution

  • spread from tidyr can do this for you. Just change the values of the year column first to match the column names you want later.

    library(tidyverse)
    data <- tibble::tribble(
              ~Year,       ~LOGWKS, ~ID,
                  1, "0,862124465",   1,
                  2, "0,433704181",   1,
                  3, "0,409959143",   1,
                  4, "0,763847693",   1,
                  5, "0,847479032",   1,
                  6, "0,855926486",   1,
                  7, "0,809774126",   1
              )
    data %>% 
      mutate(
        Year = paste0("LOGWKS", Year)
      ) %>% 
      spread(
        Year, LOGWKS
      )
    #> # A tibble: 1 x 8
    #>      ID LOGWKS1   LOGWKS2   LOGWKS3   LOGWKS4   LOGWKS5   LOGWKS6  LOGWKS7 
    #>   <dbl> <chr>     <chr>     <chr>     <chr>     <chr>     <chr>    <chr>   
    #> 1     1 0,862124~ 0,433704~ 0,409959~ 0,763847~ 0,847479~ 0,85592~ 0,80977~
    

    Created on 2019-08-09 by the reprex package (v0.3.0)

    edit: if you have multiple variables to spread you can use gather first and then cast it:

    library(tidyverse)
    data_semi_long <- tibble::tribble(
      ~Year,       ~LOGWKS,     ~whatever, ~ID,
          1, "0,402711636", "0,182708713",   1,
          2, "0,094020099", "0,776126975",   1,
          3, "0,948184845", "0,083343821",   1,
          4, "0,529592883", "0,462755147",   1,
          5, "0,612587798", "0,613195331",   1,
          6, "0,108845887", "0,032397081",   1,
          7, "0,585433903", "0,788149493",   1
      )
    data_semi_long %>% 
      gather(key, value, -ID, - Year) %>% 
      mutate(
        Year = paste0(key, Year)
      ) %>% 
      reshape2::dcast(
        ID ~Year
      )
    #>   ID     LOGWKS1     LOGWKS2     LOGWKS3     LOGWKS4     LOGWKS5
    #> 1  1 0,402711636 0,094020099 0,948184845 0,529592883 0,612587798
    #>       LOGWKS6     LOGWKS7   whatever1   whatever2   whatever3   whatever4
    #> 1 0,108845887 0,585433903 0,182708713 0,776126975 0,083343821 0,462755147
    #>     whatever5   whatever6   whatever7
    #> 1 0,613195331 0,032397081 0,788149493
    

    Created on 2019-08-09 by the reprex package (v0.3.0)