I have the following dataset
> dataset2
ID ATCcode date
1 1 N06AA 2001-01-01
2 1 N06AB 2001-04-01
3 1 N06AB 2001-03-01
4 1 N06AB 2001-02-01
5 1 N06AC 2001-01-01
6 2 N06AA 2001-01-01
7 2 N06AA 2001-02-01
8 2 N06AA 2001-03-01
9 3 N06AB 2001-01-01
10 4 N06AA 2001-02-01
11 4 N06AB 2001-03-01
It's in long format and I'd like it to be in wide format. However, I only want the earliest date for each ATCcode - and not any of the later dates. Thus I'd like to end up here:
> datasetLong
ID N06AA N06AB N06AC
1 1 2001-01-01 2001-02-01 2001-01-01
2 2 2001-01-01 <NA> <NA>
3 3 <NA> 2001-01-01 <NA>
4 4 2001-02-01 2001-03-01 <NA>
(This is just a sample of the real dataset, it has more variation in the values and more observations than this).
I've managed to cast the dataset somewhat, but not in the manner which I want to:
dataset3 <- reshape2::dcast(dataset2, ID ~ ATCcode)
gives me the length of each vector/list:
> dataset3
ID N06AA N06AB N06AC
1 1 1 3 1
2 2 3 0 0
3 3 0 1 0
4 4 1 1 0
Instead of the length, I'd like just one value, and that value should be the smallest value (or, the earliest date).
I've found a similar question asked on StackOverflow, but I was unable to use that in any way without getting various errors. I have not used melt in the above attempt, is that maybe necessary? Any help is appreciated.
This answer uses tidyverse
methods.
One way would be would be to select minimum date from each ID
and ATCcode
and convert the data to wide format.
library(dplyr)
df %>%
mutate(date = as.Date(date)) %>%
group_by(ID, ATCcode) %>%
slice(which.min(date)) %>%
tidyr::pivot_wider(names_from = ATCcode, values_from = date)
# ID N06AA N06AB N06AC
# <int> <date> <date> <date>
#1 1 2001-01-01 2001-02-01 2001-01-01
#2 2 2001-01-01 NA NA
#3 3 NA 2001-01-01 NA
#4 4 2001-02-01 2001-03-01 NA
data
df <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 4L,
4L), ATCcode = structure(c(1L, 2L, 2L, 2L, 3L, 1L, 1L, 1L, 2L,
1L, 2L), .Label = c("N06AA", "N06AB", "N06AC"), class = "factor"),
date = structure(c(1L, 4L, 3L, 2L, 1L, 1L, 2L, 3L, 1L, 2L,
3L), .Label = c("2001-01-01", "2001-02-01", "2001-03-01",
"2001-04-01"), class = "factor")), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11"))