I have a data set that goes like this:
DATA hogwarts;
INFILE DATALINES delimiter="," dsd;
LENGTH Index Wizards $ 255;
INPUT Index Wizards $ Points;
DATALINES;
1,'Harry Potter Ron Weasley',100
2,'Hermione Granger Harry Potter',200
3,'Ron Weasley',300
;
RUN;
In other words it looks like this:
Index | Wizards | Points |
---|---|---|
1 | Harry Potter Ron Weasley | 100 |
2 | Hermione Granger Harry Potter | 200 |
3 | Ron Weasley | 300 |
I need to spot regex patterns stored in a wizards
data set that looks like this:
Name |
---|
Harry Potter |
Ron Weasley |
Hermione Granger |
You can generate it with:
DATA wizards;
INPUT Name $64.;
DATALINES;
Harry Potter
Ron Weasley
Hermione Granger
;
RUN
and output a want
data set that looks like this:
Index | Wizards | Points | Wizard1 | Wizard2 |
---|---|---|---|---|
1 | Harry Potter Ron Weasley | 100 | Harry Potter | Ron Weasley |
2 | Hermione Granger Harry Potter | 200 | Hermione Granger | Harry Potter |
3 | Ron Weasley | 300 | Ron Weasley | . |
The result in Wizard1
,Wizard2
...WizardN
needs to be positional: e.g. at line 2 of hogwarts
Hermione Granger pattern appears first in the Wizards
column then appears Harry Potter so the value of Wizard1
in want
needs to be Hermione Granger.
Also patterns in wizards
can be composed of one, two or more than two words, while Wizards
is a multival columns in the hogwarts
where values are separated by spaces so we cannot count of the fact the number of space between each pattern to be fix.
Plus the number of Wizard1-WizardN
columns is not predetermined.
Does someone have a clue?
Many thanks in advance,
You can use regular expressions to detect names on a word boundary in the wizards list. The PRXMATCH function will return the position in the list where a match is made, and that position is used to sort a table of all matches. Finally, TRANSPOSE can pivot your table of matches into the 'across' structure you want.
Example:
The data crossing is an outer join, which means every row of hogwarts has some comparison to every row of wizards. In the example the wizards are stored in a temporary array, as are the corresponding regular expression used to test for asserting a positive match.
DATA hogwarts;
INFILE DATALINES delimiter="," dsd;
LENGTH Index 8 Wizards $ 255;
INPUT Index Wizards $ Points;
DATALINES;
1,'Harry Potter Ron Weasley',100
2,'Hermione Granger Harry Potter',200
3,'Ron Weasley',300
;
DATA wizards;
INPUT Name $64.;
DATALINES;
Harry Potter
Ron Weasley
Hermione Granger
;
proc sql;
select count(name) into :name_count from wizards ;
data found ;
array wzname [&name_count] $64 _temporary_ ;
array rx [&name_count] _temporary_ ;
if 0 then set hogwarts; /* prep PDV */
if _n_ = 1 then do _n_ = 1 by 1 until (end_wizards);
set wizards end=end_wizards ;
wzname [_n_] = name ;
rx[_n_] = prxparse('m/\b'||trim(name)||'\b/i') ;
end ;
set hogwarts ;
do _n_ = 1 to dim(rx) ;
pos = prxmatch (rx[_n_],wizards) ;
if pos then do ;
name = wzname[_n_] ;
output ;
end ;
end ;
run ;
proc sort data=found ;
by index pos ;
run ;
proc transpose data=found out=want(drop=_name_) prefix=wizard;
by index wizards points ;
var name ;
run ;