Search code examples
phpregexcurlphp-curlpastebin

How can I make this PHP script run faster/asynchronously?


I have a pastebin scraper script, which is designed to find leaked emails and passwords, to make a website like HaveIBeenPwned.

Here is what my script is doing:
- Scraping Pastebin links from https://psbdmp.ws/dumps
- Getting a random proxy using this Random Proxy API (because Pastebin bans your IP if you hammer too many requests): https://api.getproxylist.com/proxy
- Doing a CURL request to the Pastebin links, then doing a preg_match_all to find all the email addresses and passwords in the format email:password.

The actual script seems to be working alright, but it isn't optimized enough, and is giving me a 524 timeout error after some time, which I suspect is because of all those CURL requests.

Here is my code:
api.php

    function comboScrape_CURL($url) {
    // Get random proxy
    $proxies->json = file_get_contents("https://api.getproxylist.com/proxy");
    $proxies->decoded = json_decode($proxies->json);
    $proxy = $proxies->decoded->ip.':'.$proxies->decoded->port;
    list($ip,$port) = explode(':', $proxy);

    // Crawl with proxy
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL,$url);
    curl_setopt($ch, CURLOPT_PROXY, $proxy);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_HEADER, 1);
    $curl_scraped_page = curl_exec($ch);
    curl_close($ch);
    comboScrape('email:pass',$curl_scraped_page);
}

index.php

require('api.php');
$expression = "/(?:https\:\/\/pastebin\.com\/\w+)/";

$extension = ['','1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19','20'];
foreach($extension as $pge_number) {
    $dumps = file_get_contents("https://psbdmp.ws/dumps/".$pge_number);
    preg_match_all($expression,$dumps,$urls);
    $codes = str_replace('https://pastebin.com/','',$urls[0]);
    foreach ($codes as $code) {
        comboScrape_CURL("https://pastebin.com/raw/".$code);
    }
}

Solution

  • 524 timeout error - err, seems you're running php behind a web server (apache? nginx? lighthttpd? IIS?) don't do that, run your code from php-cli instead, php-cli can run indefinitely and never timeout.

    because Pastebin bans your IP if you hammer too many requests - buy a pastebin.com pro account instead ( https://pastebin.com/pro ), it costs about $50 (or $20 around Christmas & Black Friday), and is a lifetime account with a 1-time payment, and gives you access to the scraping api ( https://pastebin.com/doc_scraping_api ), with the scraping api you can fetch about 1 paste per second, or 86400 pastes per day, without getting ip banned.

    and because of pastebin.com's rate limits, there is no need to do this asynchronously with multiple connections (it's possible, but not worth the hassle. if you actually needed to do that however, you'd have to use the curl_multi API)