I have a large dataset with two string variables: people_attending
and special_attendee
:
*Example generated by -dataex-. To install: ssc install dataex
clear
input str148 people_attending str16 special_attendee
"; steve_jobs-apple_CEO; kevin_james-comedian; michael_crabtree-football_player; sharon_stone-actor; bill_gates-microsoft_CEO; kevin_nunes-politician" "michael_crabtree"
"; rob_lowe-actor; ted_cruz-politician; niki_minaj-music_artist; lindsey_whalen-basketball_coach" "niki_minaj"
end
The first variable varies in length and contains a list of every person who attended an event along with their title. Name and title are separated by a dash, and attendees are separated by a semi-colon and space. The second variable is an exact match of one of the names contained in the first variable.
I want to create a third variable that extracts the title for whichever person is listed in the second variable. In the above example, I would want the new variable to be "football_player" for observation 1
and "music_artist" for observation 2
.
Here is a way to do this using a simple regular expression:
generate wanted = subinstr(people_attending, special_attendee, ">", .)
replace wanted = ustrregexs(0) if ustrregexm(wanted, ">(.*?);")
replace wanted = substr(wanted, 3, strpos(wanted, ";")-3)
list wanted
+-----------------+
| wanted |
|-----------------|
1. | football_player |
2. | music_artist |
+-----------------+
In the first step you substitute the name with a marker >
. Then you extract the relevant substring using the regular expression. In the final step, you clean up.
EDIT:
The third step can be omitted if you slightly modify the code as follows:
generate wanted = subinstr(people_attending, special_attendee, ">", .)
replace wanted = ustrregexs(1) if ustrregexm(wanted, ">-(.*?);")