I want to compare two columns that have people names. I realized that many names differ mainly because they have words like "DA", "DO", "DE" "DAS" "DOS" (they are Portuguese names). I want to create new variables without these words.
So, for example, I have the following dataset:
data have;
input NAMES $50. ;
datalines;
MARIA DO CARMO SILVA
FRANCISCO DE ASSIS DETERO
MARIA DAS DORES ENTADAS
ANTONIO CALCADOS DOS SANTOS
JULIA SILVA DA SILVA GOMEDA
FELIDA DA DAMADOS DOS DOSOMEDE DE DEIXADA
;
And I want to obtain the following:
data want;
input NAMES $50. ;
datalines;
MARIA CARMO SILVA
FRANCISCO ASSIS DETERO
MARIA DORES ENTADAS
ANTONIO CALCADOS SANTOS
JULIA SILVA SILVA GOMEDA
FELIDA DAMADOS DOSOMEDE DEIXADA
;
What have I tried? Well, I tried the following, but it didnt work:
data want;
set have;
NEW_NAMES = substr(compress(NAMES, ' da do de dos das '), 2);
RUN;
Regex is the best way to go about this. You can nest a bunch of transtrn
functions together, but it wouldn't be as elegant. prxchange
will convert matching words into another word with the format s/regex here/new word here/regex options
.
We need to use the regex \b(da | do | de | dos |das )\b
. Test it out here.
prxchange('s/\b(da |do |de |dos |das )\b//i', -1, names)
says to convert the words da
, do
, de
, dos
, and das
into an empty string using case-insensitive matching (/i
option). We add a space to each word so we do not introduce double-spaces into the final string by removing the word, but this could be solved with the compbl
function if we did not do this. We replace all matches in the string with the -1
argument.
data want;
set have;
names = prxchange('s/\b(da |do |de |dos |das )\b//i', -1, names);
run;
NAMES
MARIA CARMO SILVA
FRANCISCO ASSIS DETERO
MARIA DORES ENTADAS
ANTONIO CALCADOS SANTOS
JULIA SILVA SILVA GOMEDA
FELIDA DAMADOS DOSOMEDE DEIXADA