I want to enter a very long list of urls and search for specific strings within the source code, outputting a list of urls that contain the string. Sounds simple enough right? I have come up with the bellow code, the input being a html form. You can try it at pelican-cement.com/findfrog.
It seems to work half the time, but is thrown off by multiple urls/urls in different orders. Searching for 'adsense' it correctly ids politics1.com out of
cnn.com
politics1.com
however, if reversed the output is blank. How can I get reliable, consistent results? preferably something I could input thousands of urls into?
<html>
<body>
<?
set_time_limit (0);
$urls=explode("\n", $_POST['url']);
$allurls=count($urls);
for ( $counter = 0; $counter <= $allurls; $counter++) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$urls[$counter]);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CUSTOMREQUEST,'GET');
curl_setopt ($ch, CURLOPT_HEADER, 1);
curl_exec ($ch);
$curl_scraped_page=curl_exec($ch);
$haystack=strtolower($curl_scraped_page);
$needle=$_POST['proxy'];
if (strlen(strstr($haystack,$needle))>0) {
echo $urls[$counter];
echo "<br/>";
curl_close($ch);
}
}
//$FileNameSQL = "/googleresearch" . abs(rand(0,1000000000000000)) . ".csv";
//$query = "SELECT * FROM happyturtle INTO OUTFILE '$FileNameSQL' FIELDS TERMINATED BY ','";
//$result = mysql_query($query) or die(mysql_error());
//exit;
echo '$FileNameSQL';
?>
</body>
</html>
Reorganized your code a bit. The main culprit was whitespace. You need to trim your URL string before using it (i.e. trim($url);
).
Other changes:
The code below can be run on my quick mockup.
<html>
<body>
<form action="search.php" method="post">
URLs: <br/>
<textarea rows="20" cols="50" input type="text" name="url" /></textarea><br/>
Search Term: <br/>
<textarea rows="20" cols="50" input type="text" name="proxy" /></textarea><br/>
<input type="submit" />
</form>
<?
if(isset($_POST['url'])) {
set_time_limit (0);
$urls = explode("\n", $_POST['url']);
$term = $_POST['proxy'];
$options = array( CURLOPT_FOLLOWLOCATION => 1,
CURLOPT_RETURNTRANSFER => 1,
CURLOPT_CUSTOMREQUEST => 'GET',
CURLOPT_HEADER => 1,
);
$ch = curl_init();
curl_setopt_array($ch, $options);
foreach ($urls as $url) {
curl_setopt($ch, CURLOPT_URL, trim($url));
$html = curl_exec($ch);
if ($html !== FALSE && stristr($html, $term) !== FALSE) { // Found!
echo $url;
}
}
curl_close($ch);
}
?>
</body>
</html>