I have three variables (ID, Name and City) and need to generate
a new variable flag.
There are something wrong with the observations. I need to find the wrong observations and create the flag. The variable flag indicates which column contains the wrong observation.
Suppose just one bad observation at most in each row.
Given dirty data!!!!!
|ID |Name |City
|1 |IBM |D
|1 |IBM |D
|2 |IBM |D
|3 |Google |F
|3 |Microsoft |F
|3 |Google |F
|8 |Microsoft |A
|8 |Microsoft |B
|8 |Microsoft |A
Result
|ID |Name |City |flag
|1 |IBM |D |0
|1 |IBM |D |0
|2 |IBM |D |1
|3 |Google |F |0
|3 |Microsoft |F |2
|3 |Google |F |0
|8 |Microsoft |A |0
|8 |Microsoft |B |3
|8 |Microsoft |A |0
This code is based on Eric's answer.
clear all
input float ID str9 Name str1 City
1 "IBM" "D"
1 "IBM" "D"
2 "IBM" "D"
3 "Google" "F"
3 "Microsoft" "F"
3 "Google" "F"
8 "Microsoft" "A"
8 "Microsoft" "B"
8 "Microsoft" "A"
end
encode Name, gen(Name_n)
encode City, gen(City_n)
// get dummy variable for
duplicates tag ID Name, gen(col_12)
duplicates tag ID City, gen(col_13)
duplicates tag Name City, gen(col_23)
duplicates tag ID Name City, gen(col_123)
// generate the flag
gen flag = 0
replace flag = 1 if col_123 == 0 & col_23 ~= 0
replace flag = 2 if col_123 == 0 & col_13 ~= 0
replace flag = 3 if col_123 == 0 & col_12 ~= 0
drop Name_n City_n col_*