Search code examples
stringstata

Stata flag when word found, not strpos


I have some data with strings, and I want to flag when a word is found. A word would be defined as at the start of the string, end, or separated a space. strpos will find whenever the string is present, but I am looking for something similar to subinword. Does Stata have a way to use the functionality of subinword without having to replace it, and instead flag the word?

clear 
input id str50 strings
1 "the thin th man"
2  "this old then"
3 "th to moon"
4 "moon blank th"
end

gen th_pos = 0
replace th = 1 if strpos(strings, "th") >0

This above code will flag every observation as they all contain "th", but my desired output is:

ID      strings          th_sub
1   "the thin th man"      1
2   "this old then"        0
3   "th to moon"           1
4   "moon blank th"        1

Solution

  • A small trick is that "th" as a word will be preceded and followed by a space, except if it occurs at the beginning or the end of string. The exceptions are no challenge really, as

    gen wanted = strpos(" " + strings + " ", " th ") > 0  
    

    works around them. Otherwise, there is a rich set of regular expression functions to play with.

    The example above flags that the code that doesn't do what you want condenses to one line,

    gen th_pos = strpos(strings, "th") > 0
    

    A more direct answer is that you don't have to replace anything. You just have to get Stata to tell you what would happen if you did:

    gen WANTED = strings != subinword(strings, "th", "", .)
    

    If removing a substring if present changes the string, it must have been present.