Search code examples
regexstata

How to remove everything but certain words in string variable (Stata)?


I have a string variable response, which contains text as well as categories that have already been coded (categories like "CatPlease", "CatThanks", "ExcuseMe", "Apology", "Mit", etc.). I would like to erase everything in response except for these previously coded categories.

For example, I would like response to change from:

"I Mit understand CatPlease read it again CatThanks"

to:

"Mit CatPlease CatThanks"

This seems like a simple problem, but I can't get my regex code to work perfectly. The code below attempts to store the categories in a variable cat_only. It only works if the category appears at the beginning of response. The local macro, cats, contains all of the words I would like to preserve in response:

local cats = "(CatPlease|CatThanks|ExcuseMe|Apology|Mit|IThink|DK|Confused|Offers|CatYG)?"

gen cat_only = strltrim(strtrim(ustrregexs(1)+" "+ustrregexs(2)+" "+ustrregexs(3))) if ustrregexm(response, "`cats'.+?`cats'.+?`cats'")

If I add characters to the beginning of the search pattern in ustrregexm, however, nothing will be stored in cat_only:

gen cat_only = strltrim(strtrim(ustrregexs(1)+" "+ustrregexs(2)+" "+ustrregexs(3))) if ustrregexm(response, ".+?`cats'.+?`cats'.+?`cats'")

Is there a way to fix my code to make it work, or should I approach the problem differently?


Solution

  • * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str50 response
    "I Mit understand CatPlease read it again CatThanks"
    end
    
    local regex "(?!CatPlease|CatThanks|ExcuseMe|Apology|Mit|IThink|DK|Confused|Offers|CatYG)\b[^\s]+\b"
    gen wanted = strtrim(stritrim(ustrregexra(response, "`regex'", "")))
    list
    
    . list
    
         +-------------------------------------------------------------------------------+
         |                                           response                     wanted |
         |-------------------------------------------------------------------------------|
      1. | I Mit understand CatPlease read it again CatThanks    Mit CatPlease CatThanks |
         +-------------------------------------------------------------------------------+