Hi I am an R user and this is my first time trying to parse HTML data in SAS. I was able to get the info in a text file and then read the file using the lines below but I cannot parse the data:
filename src "D:\testwebpage.txt";
proc http
method="GET"
url="xxxxx/yyyyyy"
out=src;
run;
data rep;
infile src length=len lrecl=32767;
input line $varying32767. len;
line = strip(line);
if len>0;
run;
The data in "rep" look like:
<html><body style='font-family:arial'><style type="text/css">tr.head {
background-color: #FFFFFF;
font-weight: bold;
}
tr.even {background-color: #EEEEEE}
tr.odd {background-color: #FFFFFF}</style><table><tr class="head"><td>station_no</td><td>ts_path</td><td>parametertype_name</td></tr>
<tr class="even"><td>23349</td><td>17/23349/path1</td><td>WL</td></tr>
<tr class="odd"><td>23349</td><td>17/23349/path2</td><td>WL</td></tr>
<tr class="even"><td>23349</td><td>17/23349/path3</td><td>WL</td></tr>
<tr class="odd"><td>23349</td><td>17/23349/path4</td><td>WL</td></tr>
<tr><th colspan="3"><img src="images/path.gif" align="right"/>
</th></tr>
</table>
</body></html>
I need to parse "rep" and get a dataset with station_no (23349 in this case), ts_path (17/23349/path1....), and parametertype_name (WL). Could someone please help me do this? Like I said I don't use SAS and know very little about it.
Thanks.
HTML can be a little tricky to parse depending on what you're trying to read. I only have a basic understanding of HTML, but I could identify some patterns to get it read in. This is the way I approached it:
station_no
.scan()
If you want to see how the logic works, remove the keep
and if()
statements to output everything line-by-line.
data rep;
infile src length=len lrecl=32767;
input line $varying32767. len;
line = strip(line);
/* PERL regular expression to remove HTML tags.
compbl() changes multiple spaces into one space
*/
line_notags = compbl(prxchange('s/<[^>]*>/ /', -1, line));
if(len>0);
/* Do not reset these values at the start of each row */
retain flag_table_header lag_flag_table_header;
/* Set a flag that we've encountered the table header */
if(index(line, 'station_no')) then flag_table_header = 1;
/* Check if we are currently under the table header */
lag_flag_table_header = lag(flag_table_header);
/* If we're under the table header and the line isn't missing after removing tags,
grab what we need and output. Do not output if anything else is missing.
*/
if(lag_flag_table_header AND NOT missing(line_notags) ) then do;
station_no = scan(line_notags, 1, ' ');
ts_path = scan(line_notags, 2, ' ');
parametertype_name = scan(line_notags, 3, ' ');
output;
end;
keep station_no ts_path parametertype_name;
run;
The biggest thing to remember with SAS data step language is that it's an inherently looping language. The program you write performs every action for each row of the dataset. Each time the program hits the run
statement, SAS goes to a new row and resets all of your columns to missing to prepare for another row read. You can allow a column to persist between reads with a retain
statement.
Get yourself familiar with the Program Data Vector (PDV) concept. It will really help you understand the data step language and how it processes things. There's a really great communities post here on it. Once you've mastered it, you can bounce between SQL, data steps, and procs to write flexible programs that can process monstrous datasets.