Search code examples
sasregular-language

Pick out specific text of a text file using regular expressions in SAS


I have the following data (or something of the like):

DATA test2;
INPUT STRING $31. ;
PUT STRING;
DATALINES;

James Bond is a spy
Hello World
123 Mill st P BOX 223
11 prospect ave p o box

P Box 225
Hello World
pobox 2212

P. O. box. 256
; 
run;

I would like to read only the lines that start with "Hello World" until the next blank line, such that my output would be

Hello World
123 Mill st P BOX 223
11 prospect ave p o box

Hello World
pobox 2212

My idea is to then do some manipulations on each of these two (or generally more) texts, and afterwards append them together. But first I need to only filter out the text I need. note that my original text file is huge, and where the spaces comes, I do not know.

My following attempt is this:

data test3;
 set test2;
 if _n_=1 then do; 
 retain startline endline;
 startline = prxparse('/Hello World/');
 endline = prxparse('/^\s/');
 end;

 if (prxmatch(startline,STRING)=1 or prxmatch(endline,STRING)=1) ;
 run;

It gives me the following output, but I need the rest also...:

output

EDIT: I should stress that it might be blank lines everywhere in the text, but I only want the information between "Hello World" and the next blank line


Solution

  • You have to check for start and end separately and retain the flag.

    EDIT: This way only desired data lines are output. Concatenation has to be done in a separate step.

    data test3;
     set test2;
    
     if _n_=1 then do; 
     retain startline endline start ;
     startline = prxparse('/Hello World/');
     endline = prxparse('/^\s/');
     end;
    
     if prxmatch(endline,STRING)   then start = 0;
     else if prxmatch(startline,STRING) then start = 1;
     if start then output;
    
     run;
    

    With concatenation:

    data test3;
     set test2;
    
     if _n_=1 then do; 
     retain startline endline start OUTPUT;
     length OUTPUT $3000;
     startline = prxparse('/Hello World/');
     endline = prxparse('/^\s/');
     end;
    
     if prxmatch(endline,STRING) and OUTPUT ne "" then do; /* check for endline - output string as observation and reset  */
        output;
        start = 0;
        OUTPUT = "";
     end;
    
     if start then do;
        /* Add text manipulation here */
        OUTPUT = catx(" ",OUTPUT,STRING); /* concat string */
     end;
    
     if prxmatch(startline,STRING) then start = 1; /* check for startline */
    
     keep output;
    
     run;