I have a dataset that looks like this
ID S1 S2 S3 year
1 1 2 1999
1 1 2 2000
2 1 1999
2 1 2000
3 1 1999
3 2 2000
4 1 2 3 1999
4 3 1 2 2000
5 1 2 1999
5 1 3 2000
where the first column is the subject ID, S1
to S3
are location identifiers and year refers to either 1999 or 2000. I would like to count all the occurrences in which a subject stays in the same location in both years (using Stata). That is, for subject 1, this value is 2, for subject 2, it is 1, for subject 3 it is 0, for subject 4 it is 3 and for subject 5 it is 1.
I assume your blanks are really numeric missing. This should also work if S1 S2 S3
are in fact string and the blanks are empty strings. The example that 1 2 3 and 3 1 2 are equivalent rules out various simpler solutions of just counting similarities over S1 S2 S3
in a loop.
clear
input ID S1 S2 S3 year
1 1 2 . 1999
1 1 2 . 2000
2 1 . . 1999
2 1 . . 2000
3 1 . . 1999
3 2 . . 2000
4 1 2 3 1999
4 3 1 2 2000
5 1 2 . 1999
5 1 3 . 2000
end
reshape long S, i(ID year) j(which)
bysort ID (S) : gen count = sum(S == S[_n-1] & !missing(S))
by ID: replace count = count[_N]
reshape wide S, i(ID year) j(which)
list, sepby(ID)
+----------------------------------+
| ID year S1 S2 S3 count |
|----------------------------------|
1. | 1 1999 1 2 . 2 |
2. | 1 2000 1 2 . 2 |
|----------------------------------|
3. | 2 1999 1 . . 1 |
4. | 2 2000 1 . . 1 |
|----------------------------------|
5. | 3 1999 1 . . 0 |
6. | 3 2000 2 . . 0 |
|----------------------------------|
7. | 4 1999 1 2 3 3 |
8. | 4 2000 3 1 2 3 |
|----------------------------------|
9. | 5 1999 1 2 . 1 |
10. | 5 2000 1 3 . 1 |
+----------------------------------+