Search code examples
statastring-matching

Generating dummy variable based on two string variables


I want to generate a dummy variable which is 1 if there is any match in two variables. These two variables are generated by egen concat and each contains a group of languages used in a country. For example, var1 has values of apc apc apc apc, and var2 has values of apc or var1 is apc fra nya and var2 is apc. In either cases, fndmtch2 or egen anymatch would not give me 1. Is there anyway I can get 1 for each case?


Solution

  • Your data example can be simplified to

    sysuse auto 
    egen var1 = concat(mpg foreign), punct(" ") 
    egen var2 = concat(trunk foreign), punct(" ") 
    

    as mapping to string in this instance is not needed for mpg trunk any more than it was needed for foreign. concat() maps to string on the fly, and the only issues with numeric variables (neither applying here) are if fractional parts are present or you want to see value labels.

    Now that it is confirmed that multiple words can be present, we can work with a slightly more interesting example.

    Here are two methods. One is to loop over the words in one variable and also the words in the other variable to check if there are any matches.

    Stata's definition of a word here is that words are delimited by spaces. That being so, we can check for the occurrence of " word " within " variable ", where the leading and trailing spaces are needed because in say "frog toad newt" neither "frog" nor "newt" occurs with both leading and trailing spaces. In the OP's example the check may not be needed, but it often is, just as a search for "1" or "2" or "3" finds any of those within "11 12 13", which is wrong if you seek any as a word and not as a single character.

    More is said on search for words within strings in a paper in press at the Stata Journal and likely to appear in 22(4) 2022.

    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str8 var1 str5 var2
    "FR DE"    "FR"
    "FR DE GB" "GB"
    "GB"       "FR"
    "IT FR"    "GB DE"
    end
    
    gen wc = wordcount(var1)
    su wc, meanonly 
    local max1 = r(max)
    replace wc = wordcount(var2)
    su wc, meanonly 
    local max2 = r(max)
    drop wc 
    
    gen match = 0 
    
    quietly forval i = 1/`max1' { 
        forval j = 1/`max2' { 
            replace match = 1 if word(var1, `i') == word(var2, `j') & word(var1, `i') != "" 
        }
    }
    
    gen MATCH = 0 
    
    forval i = 1/`max1' { 
        replace MATCH = 1 if strpos(" " + var2 + " ", " " + word(var1, `i') + " ") 
    }
    
    list 
    
         +----------------------------------+
         |     var1    var2   match   MATCH |
         |----------------------------------|
      1. |    FR DE      FR       1       1 |
      2. | FR DE GB      GB       1       1 |
      3. |       GB      FR       0       0 |
      4. |    IT FR   GB DE       0       0 |
         +----------------------------------+
    

    EDIT

    replace MATCH = 1 if strpos(" " + var2 + " ", " " + word(var1, `i') + " ")  & !missing(var1, var2)
    

    is better code to avoid the uninteresting match of " " with " ".