Search code examples
web-crawlerscreen-scrapinggoutte

How to use Goutte


Issue:
Cannot fully understand the Goutte web scraper.

Request:
Can someone please help me understand or provide code to help me better understand how to use Goutte the web scraper? I have read over the README.md. I am looking for more information than what that provides such as what options are available in Goutte and how to write those options or when you are looking at forms do you search for the name= or the id= of the form?

Webpage Layout attempting to be scraped:
Step 1:
The webpage has a form has a radio button to choose what kind of form to fill out (ie. Name or License). It is defaulted to Name with First and Last Name textboxes along with a State drop down menu select list. If you choose Radio there is jQuery or JavaScript that makes the First and Last Name textboxes go away and a License Textbox appears.

Step 2:
Once you have successfully submitted the form then it brings you to a page that has multiple links. We can go in to one of two of them to get our information we need.

Step 3:
Once we have successfully clicked on the link we want the third page has the data that we are looking for and we want to store that data into a php variable.

Submitting Incorrect information:
If wrong information is submitted then a jQuery/Javascript returns a message of "No records were found." on the same page as the submission.

Note:
The preferred method would be to select the license radio button, fill in the license number, choose the state and then submit the form. I have read tons of posts and blogs and other items about Goutte and nowhere can I find what options are available for Goutte, how you find out this information or how to use this information if it did exist.


Solution

  • The documentation you want to look at is the Symfony2 DomCrawler.

    Goutte is a client build on top of Guzzle that returns Crawlers every time you request/submit something:

    use Goutte\Client;
    $client = new Client();
    $crawler = $client->request('GET', 'http://www.symfony-project.org/');
    

    With this crawler you can do stuff like get all the P tags inside the body:

    $nodeValues = $crawler->filter('body > p')->each(function (Crawler $node, $i) {
        return $node->text();
    });
    print_r($nodeValues);
    

    Fill and submit forms:

    $form = $crawler->selectButton('sign in')->form(); 
    $crawler = $client->submit($form, array(
            'username' => 'username', 
            'password' => 'xxxxxx'
    ));
    

    A selectButton() method is available on the Crawler which returns another Crawler that matches a button (input[type=submit], input[type=image], or a button) with the given text. [1]

    You click on links or set options, select check-boxes and more, see Form and Link support.

    To get data from the crawler use the html or text methods

    echo $crawler->html();
    echo $crawler->text();