I have some data with strings, and I want to flag when a word is found. A word would be defined as at the start of the string, end, or separated a space. strpos
will find whenever the string is present, but I am looking for something similar to subinword
. Does Stata have a way to use the functionality of subinword
without having to replace it, and instead flag the word?
clear
input id str50 strings
1 "the thin th man"
2 "this old then"
3 "th to moon"
4 "moon blank th"
end
gen th_pos = 0
replace th = 1 if strpos(strings, "th") >0
This above code will flag every observation as they all contain "th", but my desired output is:
ID strings th_sub
1 "the thin th man" 1
2 "this old then" 0
3 "th to moon" 1
4 "moon blank th" 1
A small trick is that "th"
as a word will be preceded and followed by a space, except if it occurs at the beginning or the end of string. The exceptions are no challenge really, as
gen wanted = strpos(" " + strings + " ", " th ") > 0
works around them. Otherwise, there is a rich set of regular expression functions to play with.
The example above flags that the code that doesn't do what you want condenses to one line,
gen th_pos = strpos(strings, "th") > 0
A more direct answer is that you don't have to replace anything. You just have to get Stata to tell you what would happen if you did:
gen WANTED = strings != subinword(strings, "th", "", .)
If removing a substring if present changes the string, it must have been present.