Search code examples
rdummy-variable

Dummy variable conditioned on repetitions in grouped observations


EDIT

Thank you for your replies. However, I still haven't managed to work out my problem, as my dataset contains 700,000 obeservations, and all the approaches below results in error, or simply continues to run for hours without finishing (I can tell that Rstudio R Session is running and consuming a lot of my RAM, but it simply isn't going anywhere).

As you can imagine, splitting up the dataset in smaller pieces is not an option, since it would defeat the purpose of the exercise: I need to look through every single previous observation to get out the desired result.

Any ideas? I will leave the question unanswered for now, but if you guys think I should post a new question, I will (I honestly don't know the etiquette regarding these things, so be free to leave a suggestion).


Original post

As the title suggest, I am looking for a dummy variable, which is conditioned on repetitions in grouped observations.

Consider the following dataframe:

   id name year
1   c   af 2000
2   c   el 2000
3   c   in 2000
4   c   ud 2000
5   d   ot 2000
6   d   an 2000
7   d   el 2000
8   d   un 2000
9   f   yt 2002
10  f   ip 2002
11  f   ot 2002
12  f   el 2002
13  g   yt 2003
14  g   af 2003
15  g   ol 2003
16  g   in 2003
17  h   in 2003
18  h   eg 2003
19  h   yt 2003
20  h   af 2003
21  j   ot 2004
22  j   el 2004
23  j   ip 2004
24  j   yt 2004

I am looking for a function, that will allow me to group the data by id, and return the value "1" if an id contains at least three names have occurred in a previous id. By previous id, I mean that the year of the previous id has to be less than for the current id.

The desired output should look like this:

   id name year dummy
1   c   af 2000     0
2   c   el 2000     0
3   c   in 2000     0
4   c   ud 2000     0
5   d   ot 2000     0
6   d   an 2000     0
7   d   el 2000     0
8   d   un 2000     0
9   f   yt 2002     0
10  f   ip 2002     0
11  f   ot 2002     0
12  f   el 2002     0
13  g   yt 2003     0
14  g   af 2003     0
15  g   ol 2003     0
16  g   in 2003     0
17  h   in 2003     0
18  h   eg 2003     0
19  h   yt 2003     0
20  h   af 2003     0
21  j   ot 2004     1
22  j   el 2004     1
23  j   ip 2004     1
24  j   yt 2004     1

id = "j" takes on the value dummy = "1", as at least three names, "yt", "ip" and "ot", occurs in id = "f". In this case, there was also a fourth name that also occurred, "el", but this does not affect the result.

Notice that id = "h" takes on the value dummy = "0", even though three names also occurred in id = "g". This is because both occurrences happened in 2003, and so it does not fulfill the condition of separate years.

Data:

DF = structure(list(id = c("c", "c", "c", "c", "d", "d", "d", "d", 
"f", "f", "f", "f", "g", "g", "g", "g", "h", "h", "h", "h", "j", 
"j", "j", "j"), name = c("af", "el", "in", "ud", "ot", "an", 
"el", "un", "yt", "ip", "ot", "el", "yt", "af", "ol", "in", "in", 
"eg", "yt", "af", "ot", "el", "ip", "yt"), year = c(2000L, 2000L, 
2000L, 2000L, 2000L, 2000L, 2000L, 2000L, 2002L, 2002L, 2002L, 
2002L, 2003L, 2003L, 2003L, 2003L, 2003L, 2003L, 2003L, 2003L, 
2004L, 2004L, 2004L, 2004L), dummy = c(0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 
1L, 1L)), .Names = c("id", "name", "year", "dummy"), row.names = c(NA, 
-24L), class = "data.frame")

Solution

  • After OP's edit about speed and memory issues, how about a Rcpp approach:

    #create a integer column out of id for non-equi join later
    setDT(DF)[, nid := rleid(id)]
    
    #convert name into an integer code
    DF[DF[,.(name=unique(name))][, IntCode := .I], iname := IntCode, on=.(name)]
    
    library(inline)
    library(Rcpp)
    cppFunction('
    NumericVector hasOccur(NumericVector nid, NumericVector year, List iname) {
        List namelist(iname);
        int sz = namelist.size(), i, j, m, n, nPrev, nCurr, count;
        NumericVector res(sz);
    
        for(i=0; i<sz; i++) {
            for(j=0; j<i; j++) {
                if (nid[j] < nid[i] && year[j] < year[i]) {
                    SEXP prevList = namelist[j];
                    SEXP currList = namelist[i];
    
                    NumericVector cl(currList);
                    NumericVector pl(prevList);
                    nPrev = pl.size();
                    nCurr = cl.size();
    
                    res[i] = 0;
                    count = 0;
                    for(m=0; m<nCurr; m++) {
                        for (n=0; n<nPrev; n++) {
                            if (cl[m] == pl[n]) {
                                count++;
                                break;
                            }
                        }
                    }
    
                    if (count >= 3) {
                        res[i] = 1;
                        break;
                    }
                }
            }
        }
    
        return(res);
    }')
    
    d <- DF[, .(.(nm=iname)), by=.(nid, year)]
    DF[d[, dummy := hasOccur(d$nid, d$year, d$V1)], dummy := dummy, on=.(nid, year)]
    

    HTH.


    Another possible data.table approach:

    #create a integer column out of id for non-equi join later
    setDT(DF)[, nid := rleid(id)]
    
              #self non-equi join
    check3 <- DF[DF, .(x.id, x.name, x.year, x.nid, i.id, i.name, i.year, i.nid), on=.(nid<nid, year<year, name=name)][,
        #count the number of occurrence in previous id and year
        uniqueN(x.name, na.rm=TRUE), by=.(i.id, i.year, x.id, x.year)][,
            #check if more than 3
            any(V1 >= 3L), by=.(i.id, i.year)]
    
    #update join to add result to original DF
    DF[check3, dummy := as.integer(V1), on=c("id"="i.id", "year"="i.year")]