Search code examples
rdatereshape2dcast

How to use reshape2's dcast to select only one of several observations(values)


I have the following dataset

> dataset2
   ID ATCcode       date
1   1   N06AA 2001-01-01
2   1   N06AB 2001-04-01
3   1   N06AB 2001-03-01
4   1   N06AB 2001-02-01
5   1   N06AC 2001-01-01
6   2   N06AA 2001-01-01
7   2   N06AA 2001-02-01
8   2   N06AA 2001-03-01
9   3   N06AB 2001-01-01
10  4   N06AA 2001-02-01
11  4   N06AB 2001-03-01

It's in long format and I'd like it to be in wide format. However, I only want the earliest date for each ATCcode - and not any of the later dates. Thus I'd like to end up here:

> datasetLong
  ID      N06AA      N06AB      N06AC
1  1 2001-01-01 2001-02-01 2001-01-01
2  2 2001-01-01       <NA>       <NA>
3  3       <NA> 2001-01-01       <NA>
4  4 2001-02-01 2001-03-01       <NA>

(This is just a sample of the real dataset, it has more variation in the values and more observations than this).

I've managed to cast the dataset somewhat, but not in the manner which I want to:

dataset3 <- reshape2::dcast(dataset2, ID ~ ATCcode) 

gives me the length of each vector/list:

> dataset3
  ID N06AA N06AB N06AC
1  1     1     3     1
2  2     3     0     0
3  3     0     1     0
4  4     1     1     0

Instead of the length, I'd like just one value, and that value should be the smallest value (or, the earliest date).

I've found a similar question asked on StackOverflow, but I was unable to use that in any way without getting various errors. I have not used melt in the above attempt, is that maybe necessary? Any help is appreciated.


Solution

  • This answer uses tidyverse methods.

    One way would be would be to select minimum date from each ID and ATCcode and convert the data to wide format.

    library(dplyr)
    
    df %>%
      mutate(date = as.Date(date)) %>%
      group_by(ID, ATCcode) %>%
      slice(which.min(date)) %>%
      tidyr::pivot_wider(names_from = ATCcode, values_from = date)
    
    #     ID N06AA      N06AB      N06AC     
    #  <int> <date>     <date>     <date>    
    #1     1 2001-01-01 2001-02-01 2001-01-01
    #2     2 2001-01-01 NA         NA        
    #3     3 NA         2001-01-01 NA        
    #4     4 2001-02-01 2001-03-01 NA        
    

    data

    df <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 4L, 
    4L), ATCcode = structure(c(1L, 2L, 2L, 2L, 3L, 1L, 1L, 1L, 2L, 
    1L, 2L), .Label = c("N06AA", "N06AB", "N06AC"), class = "factor"), 
    date = structure(c(1L, 4L, 3L, 2L, 1L, 1L, 2L, 3L, 1L, 2L, 
    3L), .Label = c("2001-01-01", "2001-02-01", "2001-03-01", 
    "2001-04-01"), class = "factor")), class = "data.frame", row.names = c("1", 
    "2", "3", "4", "5", "6", "7", "8", "9", "10", "11"))