Search code examples
stata

Stata remove entire word from string


I have a string variable where I want to remove certain words, but many other words would be a partial match, which I don't want to remove. I want to remove words, if and only if they are a complete match.

clear
* Add in some example data
input index str50 words
1 "more mor morph test"
2 "ten tennis tenner tenth keeper"
3 "badder baddy bad other"
end

* I create a copy to compare obefore/after strip
gen strip_words = words

* This is a list of words I want removed. In reality, this is a fairly long list
local removs "mor ten bad"
* For each of words, remove the complete word from teh string
foreach w of local removs {
    replace strip_words = subinstr(strip_words, "`w'","", .) 
}

list
     +---------------------------------------------------------------+
     | index                            words            strip_words |
     |---------------------------------------------------------------|
  1. |     1              more mor morph test            e ph test   |
  2. |     2   ten tennis tenner tenth keeper     nis ner th keeper  |
  3. |     3           badder baddy bad other         der dy other   |
     +---------------------------------------------------------------+

I've tried padding some spaces with replace strip_words = " " + strip_words + " ", but then this also removes the spaces separating the other words. My desired output would be

     +-------------------------------------------------------------------------+
     | index                            words                      strip_words |
     |-------------------------------------------------------------------------|
  1. |     1              more mor morph test              more  morph test    |
  2. |     2   ten tennis tenner tenth keeper    tennis tenner tenth keeper    |
  3. |     3           badder baddy bad other           badder baddy  other    |
     +-------------------------------------------------------------------------+
'''

Solution

  • See help string functions for subinword().

    clear
    * Add in some example data
    input index str50 words
    1 "more mor morph test"
    2 "ten tennis tenner tenth keeper"
    3 "badder baddy bad other"
    end
    
    * I create a copy to compare obefore/after strip
    gen strip_words = words
    
    * This is a list of words I want removed. In reality, this is a fairly long list
    local removs "mor ten bad"
    * For each of words, remove the complete word from teh string
    foreach w of local removs {
        replace strip_words = subinword(strip_words, "`w'","", .) 
    }
    
    replace strip_words = itrim(strip_words)