Search code examples
rfor-loopclustered-index

Unique values for clustered factors


I know that I am doing something wrong, I am just not able to pinpoint what exactly.

I have some data in a long format, consisting of Pseudo_ID, UpdateDateO, and UpdateTimeO. Time is clustered within date, and date within ID. I would like to assign unique values (index) to UpdateDateO (dependent on ID), and UpdateTimeO (dependent on date, and ID). Here's some data:

Pseudo_ID   UpdateDateO UpdateTimeO
1   2-6-2012    95740000
1   2-6-2012    95740001
1   2-6-2012    95740002
1   3-6-2012    182642000
1   3-6-2012    182642001
1   4-6-2012    182642002
1   4-6-2012    182643000
2   20-5-2012   34040000
2   20-5-2012   101944000
2   20-5-2012   101944001
2   22-5-2012   101944002
2   22-5-2012   190936000
2   23-5-2012   190936001
2   23-5-2012   190936002
3   18-9-2013   20836000
3   18-9-2013   20836001
3   18-9-2013   20836002
3   19-9-2013   20836003
3   19-9-2013   132209000 
3   19-9-2013   132209001
3   20-9-2013   114636000
3   21-9-2013   114636001
4   4-4-2014    101347000
4   4-4-2014    101347001
4   5-4-2014    101347002
4   5-4-2014    101347003
4   5-4-2014    101347004
4   5-4-2014    192216000
4   5-4-2014    192216001
4   6-4-2014    192216002
4   6-4-2014    192216003
4   6-4-2014    192216004
4   7-4-2014    180337000
4   7-4-2014    180337001
4   7-4-2014    180337002
5   1-12-2016   111830000
5   1-12-2016   111830001
5   1-12-2016   111830002
5   1-12-2016   111830003
5   1-12-2016   192131000
5   2-12-2016   192131001
5   3-12-2016   192131002
5   3-12-2016   111831000
5   3-12-2016   111831001
5   3-12-2016   111831002

I decided to create a for loop:

for (i in unique(data$Pseudo_ID)) data$Day[data$Pseudo_ID == i]  <- match(data$UpdateDateO, unique(data$UpdateDateO)) 
for (i in unique(data$Day)) data$Time[data$Day == i] <- match(data$UpdateTimeO, unique(data$UpdateTimeO))

First of all, it gives me a warning: number of items to replace is not a multiple of replacement length. Second of all, it generates the wrong output (incorrect values for (dis)similar days/times). Also, I want the days to start at 1 for every new participant, as well as each (first) timepoint within a day). I know that I am not using the for loop in a proper way, but at this point I just don't know anymore.

This is the output:

Pseudo_ID UpdateDateO UpdateTimeO Day Time
 1    2-6-2012    95740000   1    1
 1    2-6-2012    95740001   1    2
 1    2-6-2012    95740002   1    3
 1    3-6-2012   182642000   2    1
 1    3-6-2012   182642001   2    2
 1    4-6-2012   182642002   3    1
 1    4-6-2012   182643000   3    2
 2   20-5-2012    34040000   1    4
 2   20-5-2012   101944000   1    5
 2   20-5-2012   101944001   1    6
 2   22-5-2012   101944002   2    3
 2   22-5-2012   190936000   2    4
 2   23-5-2012   190936001   3    3
 2   23-5-2012   190936002   3    4
 3   18-9-2013    20836000   1    7
 3   18-9-2013    20836001   1    8
 3   18-9-2013    20836002   1    9
 3   19-9-2013    20836003   2    5
 3   19-9-2013   132209000   2    6
 3   19-9-2013   132209001   3    5
 3   20-9-2013   114636000   3    6
 3   21-9-2013   114636001   4    1
 4    4-4-2014   101347000   1   10
 4    4-4-2014   101347001   1   11
 4    5-4-2014   101347002   1   12
 4    5-4-2014   101347003   2    7
 4    5-4-2014   101347004   2    8
 4    5-4-2014   192216000   3    7
 4    5-4-2014   192216001   3    8
 4    6-4-2014   192216002   4    2
 4    6-4-2014   192216003   4    3
 4    6-4-2014   192216004   4    4
 4    7-4-2014   180337000   5    1
 4    7-4-2014   180337001   5    2
 4    7-4-2014   180337002   6    1
 5   1-12-2016   111830000   1   13
 5   1-12-2016   111830001   1   14
 5   1-12-2016   111830002   1   15
 5   1-12-2016   111830003   2    9
 5   1-12-2016   192131000   2   10
 5   2-12-2016   192131001   3    9
 5   3-12-2016   192131002   3   10
 5   3-12-2016   111831000   4    5
 5   3-12-2016   111831001   4    6
 5   3-12-2016   111831002   4    7

Whereas I would like to see something like this:

Pseudo_ID   UpdateDateO UpdateTimeO Day Time
1   2-6-2012    95740000    1   1
1   2-6-2012    95740001    1   2
1   2-6-2012    95740002    1   3
1   2-6-2012    95740002    1   3
1   3-6-2012    182642000   2   1
1   3-6-2012    182642001   2   2
1   4-6-2012    182642002   3   1
1   4-6-2012    182643000   3   2
2   20-5-2012   34040000    1   1
2   20-5-2012   101944000   1   2
2   20-5-2012   101944001   1   3  
2   22-5-2012   101944002   2   1
2   22-5-2012   101944002   2   1
2   22-5-2012   101944002   2   1
2   22-5-2012   190936000   2   2
2   23-5-2012   190936001   3   1
2   23-5-2012   190936002   3   2
3   18-9-2013   20836000    1   1
3   18-9-2013   20836001    1   2
3   18-9-2013   20836001    1   2
3   18-9-2013   20836001    1   2
3   18-9-2013   20836002    1   3
3   19-9-2013   20836003    2   1  
3   19-9-2013   132209000   2   2
3   19-9-2013   132209000   2   2
3   19-9-2013   132209000   2   2
3   19-9-2013   132209001   2   3
3   20-9-2013   114636000   3   1
3   21-9-2013   114636001   4   1

Solution

  • A solution using dplyr and function rleid from data.table to create grouping variables easier:

    df = read.table(text = "
    Pseudo_ID   UpdateDateO UpdateTimeO
    1   2-6-2012    95740000    
    1   2-6-2012    95740001   
    1   2-6-2012    95740002    
    1   2-6-2012    95740002    
    1   3-6-2012    182642000   
    1   3-6-2012    182642001   
    1   4-6-2012    182642002   
    1   4-6-2012    182643000   
    2   20-5-2012   34040000    
    2   20-5-2012   101944000   
    2   20-5-2012   101944001     
    2   22-5-2012   101944002   
    2   22-5-2012   101944002   
    2   22-5-2012   101944002   
    2   22-5-2012   190936000   
    2   23-5-2012   190936001   
    2   23-5-2012   190936002   
    ", header=T)
    
    library(dplyr)
    library(data.table)
    
    df %>%
      group_by(Pseudo_ID) %>%
      mutate(Day = rleid(UpdateDateO)) %>%
      group_by(Pseudo_ID, UpdateDateO) %>%
      mutate(Time = rleid(UpdateTimeO)) %>%
      ungroup()
    
    # # A tibble: 17 x 5
    #   Pseudo_ID UpdateDateO UpdateTimeO   Day  Time
    #       <int> <fct>             <int> <int> <int>
    # 1         1 2-6-2012       95740000     1     1
    # 2         1 2-6-2012       95740001     1     2
    # 3         1 2-6-2012       95740002     1     3
    # 4         1 2-6-2012       95740002     1     3
    # 5         1 3-6-2012      182642000     2     1
    # 6         1 3-6-2012      182642001     2     2
    # 7         1 4-6-2012      182642002     3     1
    # 8         1 4-6-2012      182643000     3     2
    # 9         2 20-5-2012      34040000     1     1
    #10         2 20-5-2012     101944000     1     2
    #11         2 20-5-2012     101944001     1     3
    #12         2 22-5-2012     101944002     2     1
    #13         2 22-5-2012     101944002     2     1
    #14         2 22-5-2012     101944002     2     1
    #15         2 22-5-2012     190936000     2     2
    #16         2 23-5-2012     190936001     3     1
    #17         2 23-5-2012     190936002     3     2