I want to generate a dummy variable which is 1 if there is any match in two variables. These two variables are generated by egen concat
and each contains a group of languages used in a country.
For example, var1
has values of apc apc apc apc
, and var2
has values of apc
or var1
is apc fra nya
and var2
is apc
. In either cases, fndmtch2
or egen anymatch
would not give me 1. Is there anyway I can get 1 for each case?
Your data example can be simplified to
sysuse auto
egen var1 = concat(mpg foreign), punct(" ")
egen var2 = concat(trunk foreign), punct(" ")
as mapping to string in this instance is not needed for mpg trunk
any more than it was needed for foreign
. concat()
maps to string on the fly, and the only issues with numeric variables (neither applying here) are if fractional parts are present or you want to see value labels.
Now that it is confirmed that multiple words can be present, we can work with a slightly more interesting example.
Here are two methods. One is to loop over the words in one variable and also the words in the other variable to check if there are any matches.
Stata's definition of a word here is that words are delimited by spaces. That being so, we can check for the occurrence of " word "
within " variable "
, where the leading and trailing spaces are needed because in say "frog toad newt"
neither "frog"
nor "newt"
occurs with both leading and trailing spaces. In the OP's example the check may not be needed, but it often is, just as a search for "1"
or "2"
or "3"
finds any of those within "11 12 13"
, which is wrong if you seek any as a word and not as a single character.
More is said on search for words within strings in a paper in press at the Stata Journal and likely to appear in 22(4) 2022.
* Example generated by -dataex-. For more info, type help dataex
clear
input str8 var1 str5 var2
"FR DE" "FR"
"FR DE GB" "GB"
"GB" "FR"
"IT FR" "GB DE"
end
gen wc = wordcount(var1)
su wc, meanonly
local max1 = r(max)
replace wc = wordcount(var2)
su wc, meanonly
local max2 = r(max)
drop wc
gen match = 0
quietly forval i = 1/`max1' {
forval j = 1/`max2' {
replace match = 1 if word(var1, `i') == word(var2, `j') & word(var1, `i') != ""
}
}
gen MATCH = 0
forval i = 1/`max1' {
replace MATCH = 1 if strpos(" " + var2 + " ", " " + word(var1, `i') + " ")
}
list
+----------------------------------+
| var1 var2 match MATCH |
|----------------------------------|
1. | FR DE FR 1 1 |
2. | FR DE GB GB 1 1 |
3. | GB FR 0 0 |
4. | IT FR GB DE 0 0 |
+----------------------------------+
EDIT
replace MATCH = 1 if strpos(" " + var2 + " ", " " + word(var1, `i') + " ") & !missing(var1, var2)
is better code to avoid the uninteresting match of " "
with " "
.