Search code examples
databaseerror-handlingsasdetect

SAS detect and correct typo errors in data base


I have a data base with 50 obs(employees) and multiple vars: questions q1,q2....,q10 => q1-q10 with satisfaction scale of 1-5 gender with scale of 1 and 0 status with scale of 1=married 0=single income weight education years and so on..

it is known that the data has typo errors.I need to run over the data base and detect: for which vars there are typo errors which (obs)employees has errors

how do I define the conditions? for example: an error might be double digit (gender=00 instead of 0), or value exceeds scale (q2=8). should I define the error for each var separately? for the obvious ones its easy, but generally, to detect an error in var=education years, do I do "where weight=<0" since its just common sense?

proc print data=comb;
where inc<0;
where gender ne 0&1;
where married ne 0&1;
where q1-q10 ne 1-5;
where w=<0;
where h=<0;
where edc<0;

After detecting the errors I need to correct them: if both digits are similar (for example gender=00). I should present only one of them => (gender=0). how can I print only the first digit then (for this specific correction?) if value exceeds scale turn to missing. again, do I do that for each var separately?

data comb;
if gender ne 0 & 1 then gender=
else if  married ne 0&1 then married=
else if q1-q10 ne 5-10 then q1-q10='';
else if 
run;

either way I am not sure how to build these conditions correctly.


Solution

  • There's no quick way to identifying/correcting the errors other than defining the criteria as you have been doing. However, it seems you know your criteria of what needs correcting and how to correct it. The less passes SAS needs to make through the better which is good in this case as you can identify and correct them all in the same pass through the data.

    I'm not sure if you need to identify which records have had errors corrected or not but I have included do loops below to additionally set cleanflg=1 - alternatively you could swap this for putting a message to the log file (such as putlog "NO" "TE: Variable Gender has been corrected from the original value of " gender= ; (for example):

    data clean ;
      set dirty ;
      array Q[10] q1-q10 .;
    
      if inc<0 then cleanflg=1 ; *Although assume it may be better to delete;
    
      if gender not in('0','1','00') the do ;
        cleanflg=1 ;
        gender=.;
      end ;
    
      if married not in(0,1) then do ;
        cleanflg=1 ;
        married=.;
      end ;
    
      *Loop through Question array to set to missing if outside required range ;
      do i=1 to 10 ;
        if Q[i] > 5 or Q[i]<1 then do ;
          cleanflg=1 ;
          Q[i]=. ;
        end ;
      end ;
    run ;