Search code examples
rdataframedata-structuresstata

How to delete text in data where there should be numbers


I have the following numeric variable in Stata:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long r_3srhlt
3
3
2
2
4
1
1
3
3
4
end
label values r_3srhlt r_3srhlt
label def r_3srhlt 1 ".", modify
label def r_3srhlt 2 "2.very ...", modify
label def r_3srhlt 3 "3.good", modify
label def r_3srhlt 4 "5.poor", modify

I would like to keep just the number and not the text.

For example I want 3, 3, 2, 2, 5, . , . , 3, 3, 5 without the "good", "very good", "poor" etc. My data was originally a Stata file that I read via Havenin R. After doing some manipulations on the file I imported them back to Stata.

How can I accomplish this?


Solution

  • You have a numeric variable, which you first need to convert to a string:

    decode r_3srhlt, generate(r_3srhlt_string)
    

    Then you can get all numbers in one go using the real() function and a simple regular expression:

    generate wanted = real(ustrregexs(0)) if ustrregexm(r_3srhlt_string, "[0-9]*")
    
    list, separator(0) abbreviate(15)
    
         +---------------------------------------+
         |   r_3srhlt   r_3srhlt_string   wanted |
         |---------------------------------------|
      1. |     3.good            3.good        3 |
      2. |     3.good            3.good        3 |
      3. | 2.very ...        2.very ...        2 |
      4. | 2.very ...        2.very ...        2 |
      5. |     5.poor            5.poor        5 |
      6. |          .                 .        . |
      7. |          .                 .        . |
      8. |     3.good            3.good        3 |
      9. |     3.good            3.good        3 |
     10. |     5.poor            5.poor        5 |
         +---------------------------------------+