Search code examples
spssdata-management

SPSS Syntax - Identify duplicate responses and systematically identify cases to keep


I have a large set of survey data in SPSS where around 15% of respondents answered the survey more than once (this was not intended). I have formulated a systematic method to determine which cases to keep but am not sure how to write the loop to perform this task.

The variables I have are:

  • ID: unique identifier for every individual (some repeated submissions)
  • SurveyComplete: 0/1 (is the survey complete)
  • Duplicate: 0/1 (are they a person who submitted more than one survey)
  • PrimaryFirst: 0/1 (identifies first submission)
  • MatchSequence: integer (numerical indicator of which submission number the survey is)
  • Date: date of submission
  • keep: 0/1 (yet-to-be-created indicator of whether or not the record is being retained)

Here are what my data look like:

ID  SurveyComplete  Duplicate  PrimaryFirst  MatchSequence    Date   keep
123       1             1            1             1        07162015  .
123       1             1            0             2        07182015  .
456       0             1            1             1        07152015  .
456       1             1            0             2        07192015  .
789       0             1            1             1        07112015  .
789       0             1            0             2        07182015  .
789       0             1            0             3        07212015  .
012       1             0            1             1        07122015  .

Theoretically, I would like to determine the following in the order below:

  1. IF Primary = 1 AND SurveyComplete = 1 THEN keep = 1. Other submissions for this ID keep = 0.
  2. ELSE IF Primary = 0 AND SurveyComplete = 1 THEN keep = 1. Other submissions for this ID keep = 0.
  3. ELSE (where SurveyComplete = 0 for all responses) keep most recent submission.

And here is the resulting keep column:

ID  SurveyComplete  Duplicate  PrimaryFirst  MatchSequence    Date   keep
123       1             1            1             1        07162015  1
123       1             1            0             2        07182015  0
456       0             1            1             1        07152015  0
456       1             1            0             2        07192015  1
789       0             1            1             1        07112015  0
789       0             1            0             2        07182015  0
789       0             1            0             3        07212015  1
012       1             0            1             1        07122015  1

Ideally I would like to be able to complete this in SPSS syntax without plugins as my work doesn't take kindly to software add-ons. Any help that can be provided is much appreciated!


Solution

  • After every step an AGGREGATE function determines for every ID if a decision was already made. An ID that already has a decision will be taken out of the game, undecided IDs go on to the next step:

    * creating fake data to play around with.
    * note I added an extra line for ID=456 to demonstrate choice between multiple non-primary lines.
    
    DATA LIST list (", ") / ID SurveyComplete Duplicate PrimaryFirst MatchSequence Date.
            begin data
            123, 1, 1, 1, 1, 7162015
            123, 1, 1, 0, 2, 7182015
            456, 0, 1, 1, 1, 7152015
            456, 1, 1, 0, 2, 7192015
            456, 1, 1, 0, 3, 7192015
            789, 0, 1, 1, 1, 7112015
            789, 0, 1, 0, 2, 7182015
            789, 0, 1, 0, 3, 7212015
            12, 1, 0, 1, 1, 7122015
            end data.
            execute.
    
    * now starting work on defining the KEEP variable.
    
            if (PrimaryFirst = 1 AND SurveyComplete = 1) keep=1.
            if (PrimaryFirst = 0 AND SurveyComplete = 1) NotPrimarySeq=MatchSequence.
            aggregate /outfile=* mode=addvariables /break=ID /decided=max(keep)/NotPrimarySeq_min=min(NotPrimarySeq).
    
            if missing(decided) and (PrimaryFirst = 0 AND SurveyComplete = 1) keep=(NotPrimarySeq=NotPrimarySeq_min).
            aggregate/outfile=* mode=addvariables overwritevars=yes /break=ID/decided=max(keep)/Date_max=MAX(Date).
    
            if missing(decided) keep=(date=date_max).
            recode keep (miss=0).
            execute.