good evening dear community!
i want to process multiple webpages, kind of like a web spider/crawler might. I have some bits - but now i need to have some improved spider-logic. See the target-url http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50
Update:
thanks to two great comments i have gained alot. Now the code runs very nice. Last quesstion: How to store the data into a file... How to force the parser to write the results into a file. This is much more convenient than getting more than 6000 records in the command line... And if the outputs is done in a file i need to do some final cleanup: see the output: If we compare all the output with the target url - then sure this needs some cleanup, what do you think?! Again see the target-url http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50
6114,7754,"Volksschule Zeil a.Mai",/Sa,"d a.Mai",(Gru,"09524/94992 09524/94997",,Volksschulen,
6115,7757,"Mittelschule Zeil - Sa","d a.Mai",Schulri,"g
97475 Zeil","09524/94995
09524/94997",,Volksschulen," www.hauptschule-zeil-sand.de"
6116,3890,"Volksschule Zeilar",(Gru,"dschule)
Bgm.-Stallbauer-Str. 8
84367 Zeilar",,"08572/439
08572/920001",,Volksschulen," www.gs-zeilarn.de"
6117,4664,"Volksschule Zeitlar",(Gru,"dschule)
Schulstra�e 5
93197 Zeitlar",,"0941/63528
0941/68945",,Volksschulen," www.vs-zeitlarn.de"
6118,4818,"Mittelschule Zeitlar","Schulstra�e 5
93197 Zeitlar",,,"0941/63528
0941/68945",,Volksschulen," www.vs-zeitlarn.de"
6119,7684,"Volksschule Zeitlofs (Gru","dschule)
Raiffeise","Str. 36
97799 Zeitlofs",,"09746/347
09746/347",,Volksschulen," grundschule-zeitlofs.de"
thx for any and all infos! zero!
Here the old question: Seems to work fine as a part of a 1-shot function. But as soon as I include the function as part of a loop, it doesn't return anything...Whats the deal?
To begin with the beginning: see the target http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50 This page has got more than 6000 results! Well how do i get all the results? I use the module LWP::simple and i need to have some improved arguments that i can use in order to get all the 6150 records... i have a code that steems from the very supportive member tadmic (see this forum) - and that basically runs very nice. But after adding some lines - (at the moment) it spits out some errors.
Attempt: Here are the first 5 page URLs:
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=0
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=50
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=100
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=150
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=200
We can see that the "s" attribute in the URL starts at 0 for page 1, then increases by 50 for each page there after. We can use this information to create a loop:
#!/usr/bin/perl
use warnings;
use strict;
use LWP::Simple;
use HTML::TableExtract;
use Text::CSV;
my @cols = qw(
rownum
number
name
phone
type
website
);
my @fields = qw(
rownum
number
name
street
postal
town
phone
fax
type
website
);
my $i_first = "0";
my $i_last = "6100";
my $i_interval = "50";
for (my $i = $i_first; $i <= $i_last; $i += $i_interval) {
my $html = get("http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=$i");
$html =~ tr/r//d; # strip the carriage returns
$html =~ s/ / /g; # expand the spaces
my $te = new HTML::TableExtract();
$te->parse($html);
my $csv = Text::CSV->new({ binary => 1 });
foreach my $ts ($te->table_states) {
foreach my $row ($ts->rows) {
#trim leading/trailing whitespace from base fields
s/^s+//, s/\s+$// for @$row;
#load the fields into the hash using a "hash slice"
my %h;
@h{@cols} = @$row;
#derive some fields from base fields, again using a hash slice
@h{qw/name street postal town/} = split /n+/, $h{name};
@h{qw/phone fax/} = split /n+/, $h{phone};
#trim leading/trailing whitespace from derived fields
s/^s+//, s/\s+$// for @h{qw/name street postal town/};
$csv->combine(@h{@fields});
print $csv->string, "\n";
}
}
}
i tested the code and get the following results:
btw: here the lines 57 and 58: ...the command line tells that ihave errors here..:
#trim leading/trailing whitespace from derived fields
s/^s+//, s/\s+$// for @h{qw/name street postal town/};
what do you think? Are there some backslashes missing!? How to fix and testrun the code so that the results are correct!?
Look forward to hear from you zero
see the errors that i get:
Ot",,,Telefo,Fax,Schulat,Webseite Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58. Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58. Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58. Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58. "lfd. N.",Schul-numme,Schul,"ame
Sta�e
PLZ
Ot",,,Telefo,Fax,Schulat,Webseite
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
"lfd. N.",Schul-numme,Schul,"ame
Sta�e
PLZ
Ot",,,Telefo,Fax,Schulat,Webseite
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
"lfd. N.",Schul-numme,Schul,"ame
If you're trying to extract links from the pages, use WWW::Mechanize, which is a wrapper around LWP and properly parses the HTML to get the links for you, as well as a zillion other convenience things for people scraping web pages.