I know that I am doing something wrong, I am just not able to pinpoint what exactly.
I have some data in a long format, consisting of Pseudo_ID, UpdateDateO, and UpdateTimeO. Time is clustered within date, and date within ID. I would like to assign unique values (index) to UpdateDateO (dependent on ID), and UpdateTimeO (dependent on date, and ID). Here's some data:
Pseudo_ID UpdateDateO UpdateTimeO
1 2-6-2012 95740000
1 2-6-2012 95740001
1 2-6-2012 95740002
1 3-6-2012 182642000
1 3-6-2012 182642001
1 4-6-2012 182642002
1 4-6-2012 182643000
2 20-5-2012 34040000
2 20-5-2012 101944000
2 20-5-2012 101944001
2 22-5-2012 101944002
2 22-5-2012 190936000
2 23-5-2012 190936001
2 23-5-2012 190936002
3 18-9-2013 20836000
3 18-9-2013 20836001
3 18-9-2013 20836002
3 19-9-2013 20836003
3 19-9-2013 132209000
3 19-9-2013 132209001
3 20-9-2013 114636000
3 21-9-2013 114636001
4 4-4-2014 101347000
4 4-4-2014 101347001
4 5-4-2014 101347002
4 5-4-2014 101347003
4 5-4-2014 101347004
4 5-4-2014 192216000
4 5-4-2014 192216001
4 6-4-2014 192216002
4 6-4-2014 192216003
4 6-4-2014 192216004
4 7-4-2014 180337000
4 7-4-2014 180337001
4 7-4-2014 180337002
5 1-12-2016 111830000
5 1-12-2016 111830001
5 1-12-2016 111830002
5 1-12-2016 111830003
5 1-12-2016 192131000
5 2-12-2016 192131001
5 3-12-2016 192131002
5 3-12-2016 111831000
5 3-12-2016 111831001
5 3-12-2016 111831002
I decided to create a for loop:
for (i in unique(data$Pseudo_ID)) data$Day[data$Pseudo_ID == i] <- match(data$UpdateDateO, unique(data$UpdateDateO))
for (i in unique(data$Day)) data$Time[data$Day == i] <- match(data$UpdateTimeO, unique(data$UpdateTimeO))
First of all, it gives me a warning: number of items to replace is not a multiple of replacement length. Second of all, it generates the wrong output (incorrect values for (dis)similar days/times). Also, I want the days to start at 1 for every new participant, as well as each (first) timepoint within a day). I know that I am not using the for loop in a proper way, but at this point I just don't know anymore.
This is the output:
Pseudo_ID UpdateDateO UpdateTimeO Day Time
1 2-6-2012 95740000 1 1
1 2-6-2012 95740001 1 2
1 2-6-2012 95740002 1 3
1 3-6-2012 182642000 2 1
1 3-6-2012 182642001 2 2
1 4-6-2012 182642002 3 1
1 4-6-2012 182643000 3 2
2 20-5-2012 34040000 1 4
2 20-5-2012 101944000 1 5
2 20-5-2012 101944001 1 6
2 22-5-2012 101944002 2 3
2 22-5-2012 190936000 2 4
2 23-5-2012 190936001 3 3
2 23-5-2012 190936002 3 4
3 18-9-2013 20836000 1 7
3 18-9-2013 20836001 1 8
3 18-9-2013 20836002 1 9
3 19-9-2013 20836003 2 5
3 19-9-2013 132209000 2 6
3 19-9-2013 132209001 3 5
3 20-9-2013 114636000 3 6
3 21-9-2013 114636001 4 1
4 4-4-2014 101347000 1 10
4 4-4-2014 101347001 1 11
4 5-4-2014 101347002 1 12
4 5-4-2014 101347003 2 7
4 5-4-2014 101347004 2 8
4 5-4-2014 192216000 3 7
4 5-4-2014 192216001 3 8
4 6-4-2014 192216002 4 2
4 6-4-2014 192216003 4 3
4 6-4-2014 192216004 4 4
4 7-4-2014 180337000 5 1
4 7-4-2014 180337001 5 2
4 7-4-2014 180337002 6 1
5 1-12-2016 111830000 1 13
5 1-12-2016 111830001 1 14
5 1-12-2016 111830002 1 15
5 1-12-2016 111830003 2 9
5 1-12-2016 192131000 2 10
5 2-12-2016 192131001 3 9
5 3-12-2016 192131002 3 10
5 3-12-2016 111831000 4 5
5 3-12-2016 111831001 4 6
5 3-12-2016 111831002 4 7
Whereas I would like to see something like this:
Pseudo_ID UpdateDateO UpdateTimeO Day Time
1 2-6-2012 95740000 1 1
1 2-6-2012 95740001 1 2
1 2-6-2012 95740002 1 3
1 2-6-2012 95740002 1 3
1 3-6-2012 182642000 2 1
1 3-6-2012 182642001 2 2
1 4-6-2012 182642002 3 1
1 4-6-2012 182643000 3 2
2 20-5-2012 34040000 1 1
2 20-5-2012 101944000 1 2
2 20-5-2012 101944001 1 3
2 22-5-2012 101944002 2 1
2 22-5-2012 101944002 2 1
2 22-5-2012 101944002 2 1
2 22-5-2012 190936000 2 2
2 23-5-2012 190936001 3 1
2 23-5-2012 190936002 3 2
3 18-9-2013 20836000 1 1
3 18-9-2013 20836001 1 2
3 18-9-2013 20836001 1 2
3 18-9-2013 20836001 1 2
3 18-9-2013 20836002 1 3
3 19-9-2013 20836003 2 1
3 19-9-2013 132209000 2 2
3 19-9-2013 132209000 2 2
3 19-9-2013 132209000 2 2
3 19-9-2013 132209001 2 3
3 20-9-2013 114636000 3 1
3 21-9-2013 114636001 4 1
A solution using dplyr
and function rleid
from data.table
to create grouping variables easier:
df = read.table(text = "
Pseudo_ID UpdateDateO UpdateTimeO
1 2-6-2012 95740000
1 2-6-2012 95740001
1 2-6-2012 95740002
1 2-6-2012 95740002
1 3-6-2012 182642000
1 3-6-2012 182642001
1 4-6-2012 182642002
1 4-6-2012 182643000
2 20-5-2012 34040000
2 20-5-2012 101944000
2 20-5-2012 101944001
2 22-5-2012 101944002
2 22-5-2012 101944002
2 22-5-2012 101944002
2 22-5-2012 190936000
2 23-5-2012 190936001
2 23-5-2012 190936002
", header=T)
library(dplyr)
library(data.table)
df %>%
group_by(Pseudo_ID) %>%
mutate(Day = rleid(UpdateDateO)) %>%
group_by(Pseudo_ID, UpdateDateO) %>%
mutate(Time = rleid(UpdateTimeO)) %>%
ungroup()
# # A tibble: 17 x 5
# Pseudo_ID UpdateDateO UpdateTimeO Day Time
# <int> <fct> <int> <int> <int>
# 1 1 2-6-2012 95740000 1 1
# 2 1 2-6-2012 95740001 1 2
# 3 1 2-6-2012 95740002 1 3
# 4 1 2-6-2012 95740002 1 3
# 5 1 3-6-2012 182642000 2 1
# 6 1 3-6-2012 182642001 2 2
# 7 1 4-6-2012 182642002 3 1
# 8 1 4-6-2012 182643000 3 2
# 9 2 20-5-2012 34040000 1 1
#10 2 20-5-2012 101944000 1 2
#11 2 20-5-2012 101944001 1 3
#12 2 22-5-2012 101944002 2 1
#13 2 22-5-2012 101944002 2 1
#14 2 22-5-2012 101944002 2 1
#15 2 22-5-2012 190936000 2 2
#16 2 23-5-2012 190936001 3 1
#17 2 23-5-2012 190936002 3 2