Search code examples
spss

SPSS - How to identify First and Last available measurements in a row of 120 variables


I have checked a previous post about it (SPSS Last available measurement in a row of variables), but I still have doubts on how to perform this task more efficiently. I have a dataset with 98 million rows and 120 variables (one for each month, from January/2005 to December/2014). I need to identify the first and the last valid (non-missing) measurements in a row of variables, for each observation in the dataset. The dataset looks like:

v1 v2 ... v120
1 2 ... 5
. 2 ... 5
3 1 ... .

I have tried the version using loops (also suggested on this topic: SPSS Last available measurement in a row of variables). I used the syntaxes below, but they didn’t work. I keep getting error messages… It is very likely that, since I did not understand all the steps in it, I misused it.

DEFINE LAST_VALID ()    
!DO !@ = 1 !TO 120 .    
!LET !a = !CONCAT("v", !@) .    
COMPUTE LAST_VALID = !a .    
!DOEND .    
!ENDDEFINE.    
LAST_VALID .     
EXECUTE.    

ERROR MESSAGES (some of them):

Error # 4382 in column 1024. Text: (End of Command) >An equals sign was >not found when expected after a target variable in a >COMPUTE command. >Execution of this command stops.

Warning # 231 >The depth of macro nesting has reached the current limit. >To increase the >limit, use SET MNEST. To check the limit use the SHOW >command.

Warning # 210 in column 9. Text: !ERROR_MACRO >A macro symbol is invalid >in this context. >The symbol will be treated as an invalid special >character.

Error # 4285 in column 9. Text: !ERROR_MACRO >Incorrect variable name: >either the name is more than 64 characters, or it is >not defined by a >previous command. >Execution of this command stops.

DEFINE FIRST_VALID ()    
!DO !@ = 1 !TO 120 .    
!LET !a = !CONCAT("v", !@) .    
LOOP IF MISSING (FIRST_VALID) = 1.    
COMPUTE FIRST_VALID = !a .    
END LOOP IF FIRST_VALID > 0.    
!DOEND .    
!ENDDEFINE.    
FIRST_VALID.    
EXECUTE.    

ERROR MESSAGES (some of them):

Warning # 231 >The depth of macro nesting has reached the current limit. >To increase the >limit, use SET MNEST. To check the limit use the SHOW >command.

Warning # 210 in column 18. Text: !ERROR_MACRO >A macro symbol is invalid >in this context. >The symbol will be treated as an invalid special >character.

Error # 4007 in column 18. Text: !ERROR_MACRO >The expression is >incomplete. Check for missing operands, invalid operators, >unmatched >parentheses or excessive string length. >Execution of this command stops.

Error # 4846 in column 18. Text: !ERROR_MACRO >The LOOP command contains >unrecognized text after the end of the IF clause.

Warning # 210 in column 9. Text: !ERROR_MACRO >A macro symbol is invalid >in this context. >The symbol will be treated as an invalid special >character.

Error # 4285 in column 9. Text: !ERROR_MACRO >Incorrect variable name: >either the name is more than 64 characters, or it is >not defined by a >previous command. >Execution of this command stops.

Error # 4014 in column 13. Text: !ERROR_MACRO >SPSS Statistics was >expecting an expression but encountered the end of the >command. Check the >expression for omitted or extra operands, operators, and >parentheses. >Execution of this command stops.

Error # 4045. Command name: END LOOP >The END LOOP command does not follow >an unclosed LOOP command. Maybe the LOOP >command was not recognized >because of an error. Use the level-of-control >shown to the left of the >SPSS Statistics commands to determine the range of >LOOPs and DO IFs.*

I am probably forgetting to adjust my syntax for something, but I cannot figure out what. I apologize if it is too obvious…


Solution

  • It is not clear from your description nor your code snippets what you want. Either the last actual value or the index location of that value. E.g. if the last valid value is in V110 and equals 5 for record 1 million, do you want to know "110" or do you want to know "5".

    Here is a simple approach using DO REPEAT that will return either. LastVal would be the "5" in my example, and LastId would be the 110.

    DO REPEAT V = V1 TO V120 /#i = 1 TO 120.
      DO IF NOT MISSING(V).
        COMPUTE LastVal = V.
        COMPUTE LastId = #i.
      END IF.
    END REPEAT.
    

    To also return the first index and value you can do a second DO IF within the DO REPEAT.

    NUMERIC FirstVal FirstId.
    DO REPEAT V = V1 TO V120 /#i = 1 TO 120.
      DO IF NOT MISSING(V).
        COMPUTE LastVal = V.
        COMPUTE LastId = #i.
        DO IF MISSING(FirstVal).
          COMPUTE FirstVal = V.
          COMPUTE FirstId = #i.
        END IF.
      END IF.
    END REPEAT.
    

    With 98 million records, this will probably take awhile - especially if not on a server. You may experiment by using VARSTOCASES, which by default drops missing data. But these will chug along until it is done.


    You can do some improvements in efficiency by using LOOP to the same effect, but by breaking out when the first valid value is encountered. So for the first valid value you could do.

    VECTOR V = V1 TO V120.
    LOOP #i = 1 TO 120.
    END LOOP IF NOT MISSING(V(#i)).
    COMPUTE FirstVal = V(#i).
    COMPUTE FirstId = #i.
    

    For the last values you just need to reverse the loop.

    VECTOR V = V1 TO V120.
    LOOP #i = 120 TO 1 BY -1.
    END LOOP IF NOT MISSING(V(#i)).
    COMPUTE LastVal = V(#i).
    COMPUTE LastId = #i.