Search code examples
stringreplacespss

Removing non-numbers from a string in SPSS


Consider the following data:

Sample data - numbers mixed with text

As you can see, the values of the variable are inherently numeric, but include text in some of them. I have tried every permutation I could think of do repeat...end repeat to try and remove the non-numeric values and leave just the numbers, without success.

Is there some syntax that will do it? Is there a function that checks whether a substr contains any of a set of characters? Then I could create a set that represents all the digits, loop through each character in the string, and if it is not in the set, replace it with a null.


Solution

  • This answer on IBM support answers a somewhat similar question: https://www.ibm.com/support/pages/removing-unwanted-characters-strings

    You will have a lot more characters to search (the whole a-z, A-Z and probably some non-letter characters as well), but it should work. You might also want to use the newer, CHAR.INDEX and CHAR.REPLACE functions, if you are using SPSS 223 or newer; see the official IBM SPSS documentation on them: https://www.ibm.com/support/knowledgecenter/en/SSLVMB_23.0.0/spss/base/syn_transformation_expressions_string_functions.html

    Later Edit (after clarifications and suggestions from the OP:

    What you need to adjust in the IBM examples is 2 things:

    1. hardcode the loop exit after k iterations (not when #I=0 - that will stop at the first character it does not find). In the below example, k is set to 100.

    2. specify all characters you want to remove: a to z, space, quotation (as 2 consecutive quotation signs), and so on; anything you think you might want to clean. Then this should work (and indeed stackoverflow, formatting does not seem to be working properly at the moment)

      COMPUTE x=LOWER(x).

      LOOP k=1 to CHAR.LENGTH(x).

      COMPUTE #I = CHAR.INDEX(X,'abcdefghijklmnopqrstuvwxyz+, ''',1).

      IF #I > 0 X=CONCAT(CHAR.SUBSTR(X,1,#I-1), CHAR.SUBSTR(X,#I+1)).

      END LOOP.

      EXECUTE.