My question is probably quite basic but I've been struggling with it so I'd be really grateful if someone could offer a solution.
I have data in the following format:
ORG_NAME | var_1_12 | var_1_13 | var_1_14 |
---|---|---|---|
A | 12 | 11 | 5 |
B | 13 | 13 | 11 |
C | 6 | 7 | NA |
D | NA | NA | 5 |
I have data on organizations over 5 years, but over that time, some organizations have merged and others have disappeared. I'm planning on conducting a fixed-effects regression, so I need to add a dummy variable which is "0" when organizations have remained the same (in this case row A and row B), and "1" in the year before the merge, and after the merge. In this case, I know that orgs C and D merged, so I would like for the data to look like this:
ORG_NAME | var_1_12 | dum_12 | var_1_13 | dum_13 |
---|---|---|---|---|
A | 12 | 0 | 5 | 0 |
B | 13 | 0 | 11 | 0 |
C | 6 | 1 | NA | 1 |
D | NA | 1 | 5 | 1 |
How would I code this?
This approach (as is any, according to your description) is absolutely dependent on the companies being in consecutive rows.
mtx <- apply(is.na(dat[,-1]), MARGIN = 2,
function(vec) zoo::rollapply(vec, 2, function(z) xor(z[1], z[2]), fill = FALSE))
mtx
# var_1_12 var_1_13 var_1_14
# [1,] FALSE FALSE FALSE
# [2,] FALSE FALSE TRUE
# [3,] TRUE TRUE TRUE
# [4,] FALSE FALSE FALSE
out <- rowSums(mtx) == ncol(mtx)
out
# [1] FALSE FALSE TRUE FALSE
out | c(FALSE, out[-length(out)])
# [1] FALSE FALSE TRUE TRUE
### and your 0/1 numbers, if logical isn't right for you
+(out | c(FALSE, out[-length(out)]))
# [1] 0 0 1 1
Brief walk-through:
is.na(dat[,-1])
returns a matrix of whether the values (except the first column) are NA
; because it's a matrix, we use apply
to call a function on each column (using MARGIN=2
);
zoo::rollapply
is a function that does rolling calculations on a portion ("window") of the vector at a time, in this case 2-wide. For example, if we have 1:5
, then it first looks at c(1,2)
, then c(2,3)
, then c(3,4)
, etc.
xor
is an eXclusive OR, meaning it will be true when one of its arguments are true and the other is false;
mtx
is a matrix indicating that a cell and the one below it met the conditions (one is NA
, the other is not). We then check to see which of these rows are all true, forming out
.
since we need a 1
in both rows, we vector-AND &
out with itself, shifted, to produce your intended output