Search code examples
stata

Extract term within a string that matches a variable


I have a large dataset with two string variables: people_attending and special_attendee:

*Example generated by -dataex-. To install: ssc install dataex
clear
input str148 people_attending str16 special_attendee
"; steve_jobs-apple_CEO; kevin_james-comedian; michael_crabtree-football_player; sharon_stone-actor; bill_gates-microsoft_CEO; kevin_nunes-politician" "michael_crabtree"
"; rob_lowe-actor; ted_cruz-politician; niki_minaj-music_artist; lindsey_whalen-basketball_coach"                                                      "niki_minaj"      
end

The first variable varies in length and contains a list of every person who attended an event along with their title. Name and title are separated by a dash, and attendees are separated by a semi-colon and space. The second variable is an exact match of one of the names contained in the first variable.

I want to create a third variable that extracts the title for whichever person is listed in the second variable. In the above example, I would want the new variable to be "football_player" for observation 1 and "music_artist" for observation 2.


Solution

  • Here is a way to do this using a simple regular expression:

    generate wanted = subinstr(people_attending, special_attendee, ">", .)
    replace wanted = ustrregexs(0) if ustrregexm(wanted, ">(.*?);")
    replace wanted = substr(wanted, 3, strpos(wanted, ";")-3)
    
    list wanted
    
         +-----------------+
         |          wanted |
         |-----------------|
      1. | football_player |
      2. |    music_artist |
         +-----------------+
    

    In the first step you substitute the name with a marker >. Then you extract the relevant substring using the regular expression. In the final step, you clean up.


    EDIT:

    The third step can be omitted if you slightly modify the code as follows:

    generate wanted = subinstr(people_attending, special_attendee, ">", .)
    replace wanted = ustrregexs(1) if ustrregexm(wanted, ">-(.*?);")