I am trying to web scrape this website with Laravel: https://datacvr.virk.dk/soegeresultater?sideIndex=0&enhedstype=virksomhed&antalAnsatte=ANTAL_20_49&virksomhedsstatus=aktiv%252Cnormal&size=10
With other websites, e.g. Wikipedia, the following code works smoothly. However, on this website it returns an error HTML page, where the following error gets shown: "We're sorry but client doesn't work properly without JavaScript enabled. Please enable it to continue." I suppose the reason Javescript is not enabled is because I use Laravel Symfony to web scrape, which is shown below.
<?php
namespace App\Http\Controllers;
use Symfony\Component\HttpClient\HttpClient;
use Symfony\Component\BrowserKit\HttpBrowser;
class CompaniesController extends Controller
{
public function index() {
$client = new HttpBrowser(HttpClient::create());
$crawler = $client->request('GET', 'https://datacvr.virk.dk/soegeresultater?sideIndex=0&enhedstype=virksomhed&antalAnsatte=ANTAL_20_49&virksomhedsstatus=aktiv%252Cnormal&size=10');
return $crawler->html();
}
}
I also tried following this tutorial: https://webmobtuts.com/backend-development/using-laravel-and-symfony-panther-to-scrape-javascript-websites/ Where I tried using the Symfony Panther Client to mock/create a Chrome Client. You can see my code here:
<?php
namespace App\Http\Controllers;
use Symfony\Component\Panther\Client;
class CompaniesController extends Controller
{
public function index() {
$client = Client::createChromeClient(); // create a chrome client
$crawler = $client->request('GET', 'https://datacvr.virk.dk/soegeresultater?sideIndex=0&enhedstype=virksomhed&antalAnsatte=ANTAL_20_49&virksomhedsstatus=aktiv%252Cnormal&size=10');
$client->waitFor('div');
return $crawler->html();
}
}
However, this returns a similar error: "Enable JavaScript and cookies to continue" within an HTML page.
How can you enable JavaScript on web scraping? Do I need to add something to the header of the request, or use a different library?
The issue you're facing is related to the fact that the websites you are trying to scrape rely on JavaScript to render the content, and the Symfony Panther or Symfony BrowserKit components you are using don't execute JavaScript.
To scrape websites that heavily depend on JavaScript, you can consider using a headless browser automation tool. One popular choice is Puppeteer, which is a Node library for controlling headless browsers.
Here's how you can use Puppeteer with Laravel:
1.Install Puppeteer via Composer:
composer require symfony/panther
2.Use Puppeteer in Your Controller:
<?php
namespace App\Http\Controllers;
use Symfony\Component\Panther\PantherTestCase;
class CompaniesController extends Controller
{
public function index()
{
PantherTestCase::startWebServer();
$client = PantherTestCase::createPantherClient();
$crawler = $client->request('GET', 'https://datacvr.virk.dk/soegeresultater?sideIndex=0&enhedstype=virksomhed&antalAnsatte=ANTAL_20_49&virksomhedsstatus=aktiv%252Cnormal&size=10');
// Wait for some JavaScript content to load if needed
$client->waitFor('.some-element-class');
return $crawler->html();
}
}
In this example, PantherTestCase::createPantherClient() creates a Panther client that utilizes a headless Chrome browser, allowing you to interact with JavaScript-rendered content.
Make sure to run your application in a development environment (php artisan serve), as Panther requires a web server to be running.
Run The Code Now, when you access your Laravel application, it should be using Panther to scrape the website, including executing JavaScript.
Note: Be cautious when scraping websites and be sure to review the website's terms of service to ensure compliance. Some websites explicitly prohibit scraping in their terms, and unauthorized scraping may result in legal consequences. Always ensure that your scraping activities are respectful and adhere to applicable laws and regulations.