Search code examples
rpanel-data

How to add dummy variables to data with specific characteristic


My question is probably quite basic but I've been struggling with it so I'd be really grateful if someone could offer a solution.

I have data in the following format:

ORG_NAME var_1_12 var_1_13 var_1_14
A 12 11 5
B 13 13 11
C 6 7 NA
D NA NA 5

I have data on organizations over 5 years, but over that time, some organizations have merged and others have disappeared. I'm planning on conducting a fixed-effects regression, so I need to add a dummy variable which is "0" when organizations have remained the same (in this case row A and row B), and "1" in the year before the merge, and after the merge. In this case, I know that orgs C and D merged, so I would like for the data to look like this:

ORG_NAME var_1_12 dum_12 var_1_13 dum_13
A 12 0 5 0
B 13 0 11 0
C 6 1 NA 1
D NA 1 5 1

How would I code this?


Solution

  • This approach (as is any, according to your description) is absolutely dependent on the companies being in consecutive rows.

    mtx <- apply(is.na(dat[,-1]), MARGIN = 2,
                 function(vec) zoo::rollapply(vec, 2, function(z) xor(z[1], z[2]), fill = FALSE))
    mtx
    #      var_1_12 var_1_13 var_1_14
    # [1,]    FALSE    FALSE    FALSE
    # [2,]    FALSE    FALSE     TRUE
    # [3,]     TRUE     TRUE     TRUE
    # [4,]    FALSE    FALSE    FALSE
    
    out <- rowSums(mtx) == ncol(mtx)
    out
    # [1] FALSE FALSE  TRUE FALSE
    out | c(FALSE, out[-length(out)])
    # [1] FALSE FALSE  TRUE  TRUE
    
    ### and your 0/1 numbers, if logical isn't right for you
    +(out | c(FALSE, out[-length(out)]))
    # [1] 0 0 1 1
    

    Brief walk-through:

    • is.na(dat[,-1]) returns a matrix of whether the values (except the first column) are NA; because it's a matrix, we use apply to call a function on each column (using MARGIN=2);

    • zoo::rollapply is a function that does rolling calculations on a portion ("window") of the vector at a time, in this case 2-wide. For example, if we have 1:5, then it first looks at c(1,2), then c(2,3), then c(3,4), etc.

    • xor is an eXclusive OR, meaning it will be true when one of its arguments are true and the other is false;

    • mtx is a matrix indicating that a cell and the one below it met the conditions (one is NA, the other is not). We then check to see which of these rows are all true, forming out.

    • since we need a 1 in both rows, we vector-AND & out with itself, shifted, to produce your intended output