Search code examples
sas

SAS: How to spot Regex patterns in space separated values and split them into multiple columns


I have a data set that goes like this:

DATA hogwarts;
INFILE DATALINES delimiter="," dsd;
LENGTH Index Wizards $ 255;
INPUT Index Wizards $ Points;
DATALINES;
1,'Harry Potter Ron Weasley',100
2,'Hermione Granger Harry Potter',200
3,'Ron Weasley',300
;
RUN;

In other words it looks like this:

Index Wizards Points
1 Harry Potter Ron Weasley 100
2 Hermione Granger Harry Potter 200
3 Ron Weasley 300

I need to spot regex patterns stored in a wizards data set that looks like this:

Name
Harry Potter
Ron Weasley
Hermione Granger

You can generate it with:

DATA wizards;
INPUT Name $64.;
DATALINES;
Harry Potter
Ron Weasley
Hermione Granger
;
RUN

and output a want data set that looks like this:

Index Wizards Points Wizard1 Wizard2
1 Harry Potter Ron Weasley 100 Harry Potter Ron Weasley
2 Hermione Granger Harry Potter 200 Hermione Granger Harry Potter
3 Ron Weasley 300 Ron Weasley .

The result in Wizard1,Wizard2...WizardN needs to be positional: e.g. at line 2 of hogwarts Hermione Granger pattern appears first in the Wizards column then appears Harry Potter so the value of Wizard1 in want needs to be Hermione Granger.

Also patterns in wizards can be composed of one, two or more than two words, while Wizards is a multival columns in the hogwarts where values are separated by spaces so we cannot count of the fact the number of space between each pattern to be fix.

Plus the number of Wizard1-WizardN columns is not predetermined.

Does someone have a clue?

Many thanks in advance,


Solution

  • You can use regular expressions to detect names on a word boundary in the wizards list. The PRXMATCH function will return the position in the list where a match is made, and that position is used to sort a table of all matches. Finally, TRANSPOSE can pivot your table of matches into the 'across' structure you want.

    Example:

    The data crossing is an outer join, which means every row of hogwarts has some comparison to every row of wizards. In the example the wizards are stored in a temporary array, as are the corresponding regular expression used to test for asserting a positive match.

    DATA hogwarts;
    INFILE DATALINES delimiter="," dsd;
    LENGTH Index 8 Wizards $ 255;
    INPUT Index Wizards $ Points;
    DATALINES;
    1,'Harry Potter Ron Weasley',100
    2,'Hermione Granger Harry Potter',200
    3,'Ron Weasley',300
    ;
    DATA wizards;
    INPUT Name $64.;
    DATALINES;
    Harry Potter
    Ron Weasley
    Hermione Granger
    ;
    
    proc sql;
      select count(name) into :name_count from wizards ;
    data found ;
      array wzname [&name_count] $64 _temporary_ ;
      array rx [&name_count] _temporary_ ;
    
      if 0 then set hogwarts; /* prep PDV */
    
      if _n_ = 1 then do _n_ = 1 by 1 until (end_wizards);
        set wizards end=end_wizards ;
        wzname [_n_] = name ;
        rx[_n_] = prxparse('m/\b'||trim(name)||'\b/i') ;
      end ;
      
      set hogwarts ;
    
      do _n_ = 1 to dim(rx) ;
        pos = prxmatch (rx[_n_],wizards) ;
        if pos then do ;
          name = wzname[_n_] ;
          output ;
        end ;
      end ;
    run ;
    proc sort data=found ;
      by index pos ;
    run ;
    proc transpose data=found out=want(drop=_name_) prefix=wizard;
      by index wizards points ;
      var name ;
    run ;
    

    enter image description here