Search code examples
matlabstata

Clean dirty data


I have three variables (ID, Name and City) and need to generate a new variable flag.

There are something wrong with the observations. I need to find the wrong observations and create the flag. The variable flag indicates which column contains the wrong observation.

Suppose just one bad observation at most in each row.

Given dirty data!!!!!

|ID      |Name             |City 
|1       |IBM              |D    
|1       |IBM              |D    
|2       |IBM              |D    
|3       |Google           |F    
|3       |Microsoft        |F    
|3       |Google           |F    
|8       |Microsoft        |A    
|8       |Microsoft        |B    
|8       |Microsoft        |A    

Result

|ID      |Name             |City |flag
|1       |IBM              |D    |0
|1       |IBM              |D    |0
|2       |IBM              |D    |1
|3       |Google           |F    |0
|3       |Microsoft        |F    |2
|3       |Google           |F    |0
|8       |Microsoft        |A    |0
|8       |Microsoft        |B    |3
|8       |Microsoft        |A    |0

Solution

  • This code is based on Eric's answer.

    clear all
    input float ID str9 Name str1 City
    1 "IBM"       "D"
    1 "IBM"       "D"
    2 "IBM"       "D"
    3 "Google"    "F"
    3 "Microsoft" "F"
    3 "Google"    "F"
    8 "Microsoft" "A"
    8 "Microsoft" "B"
    8 "Microsoft" "A"
    end
    
    encode Name, gen(Name_n)
    encode City, gen(City_n)
    
    // get dummy variable for 
    duplicates tag ID Name, gen(col_12)
    duplicates tag ID City, gen(col_13)
    duplicates tag Name City, gen(col_23)
    duplicates tag ID Name City, gen(col_123)
    
    // generate the flag
    gen flag = 0
    replace flag = 1 if col_123 == 0 & col_23 ~= 0
    replace flag = 2 if col_123 == 0 & col_13 ~= 0
    replace flag = 3 if col_123 == 0 & col_12 ~= 0
    
    drop Name_n City_n col_*