Search code examples
phpguzzledomcrawler

Guzzle and DomCrawler


I'm using Guzzle and DomCrawler to scrape data from a webpage, everythings working well except for one issue. Its inserting weird characters into the data that I scrape. Heres an example:

    [2]=>
array(4) {
  ["cell_lines"]=>
  string(4) "A549"
  ["cancer"]=>
  string(4) "Lung"
  ["ic50"]=>
  string(7) ">40 ┬ÁM"
  ["pmid"]=>
  string(8) "10380632"
}
[3]=>
array(4) {
  ["cell_lines"]=>
  string(16) "B16 melanoma 4A5"
  ["cancer"]=>
  string(4) "Skin"
  ["ic50"]=>
  string(7) ">40 ┬ÁM"
  ["pmid"]=>
  string(8) "10380632"
}
[4]=>
array(4) {
  ["cell_lines"]=>
  string(9) "TGBC11TKB"
  ["cancer"]=>
  string(7) "Stomach"
  ["ic50"]=>
  string(7) ">40 ┬ÁM"
  ["pmid"]=>
  string(8) "10380632"
}

The value >40 ┬ÁM

The value thats supposed to be there is >40 µM

But its not just Greek letters its doing this to, heres another example:

  ["properties"]=>
array(6) {
["logp"]=>
string(5) "á2.85"
["vdw_volume"]=>
string(8) " 239.67"
["polar_surface_area"]=>
string(7) " 75.99"
["refractivity"]=>
string(8) " 363.43"
["mass"]=>
string(9) " 284.068"
["formula"]=>
string(10) " C16H12O5"

As far as I can see, there are only   spacers before these numeric values. Its converting everything into ┬á for some reason. if I wrap everything in utf8_decode($crawler->text())

Heres what I get:

["properties"]=>
array(6) {
["logp"]=>
string(5) "?2.85"
["vdw_volume"]=>
string(7) "á239.67"
["polar_surface_area"]=>
string(6) "á75.99"
["refractivity"]=>
string(7) "á363.43"
["mass"]=>
string(8) "á284.068"
["formula"]=>
string(9) "áC16H12O5"

so all that changes is I get á instead of ┬Á

I have tried creating the Crawler instance like this:

$crawler = new Crawler('','http://crdd.osdd.net/raghava/npact/');
$crawler->addHTMLContent($raw, 'UTF-8');

It changes nothing. I tried adding this header to the top of the file:

header('Content-Type: text/html; charset=utf8;');

It had no effect.

Heres how I'm opening the Guzzle client:

$client = new Client(array(
'base_uri' => 'http://crdd.osdd.net/'

));

$response = $client->request('GET','raghava/npact/brws_alp.php?b=A');

https://gist.github.com/pschultz/6554265#file-forcecharsetplugin-php

I tried installing the ForceChartSet plugin, which I found here:

and implemented it like this:

// create http client instance
$client = new Client(array(
    'base_uri' => 'http://crdd.osdd.net'
));

$plugin = new ForceCharsetPlugin();
$plugin->setForcedCharset('utf8');
// Guzzle only
$client->addSubscriber($plugin);

and I get this error:

Fatal error: Uncaught exception 'InvalidArgumentException' with message 'URI must be a string or UriInterface' in C:\wamp64\www\spider\osdd\vendor\guzzlehttp\psr7\src\functions.php:62 Stack trace:

0 C:\wamp64\www\spider\osdd\vendor\guzzlehttp\guzzle\src\Client.php(142):

GuzzleHttp\Psr7\uri_for(Object(ForceCharsetPlugin))

1 C:\wamp64\www\spider\osdd\vendor\guzzlehttp\guzzle\src\Client.php(115):

GuzzleHttp\Client->buildUri(Object(ForceCharsetPlugin), Array)

2 C:\wamp64\www\spider\osdd\vendor\guzzlehttp\guzzle\src\Client.php(129):

GuzzleHttp\Client->requestAsync('addSubscriber', Object(ForceCharsetPlugin), Array)

3 C:\wamp64\www\spider\osdd\vendor\guzzlehttp\guzzle\src\Client.php(87):

GuzzleHttp\Client->request('addSubscriber', Object(ForceCharsetPlugin), Array)

4 C:\wamp64\www\spider\osdd\osdd_data.php(185): GuzzleHttp\Client->__call('addSubscriber', Array)

5 C:\wamp64\www\spider\osdd\osdd_data.php(185): GuzzleHttp\Client->addSubscriber(Object(ForceCharsetPlugin))

6 {main} thrown in C:\wamp64\www\spider\osdd\vendor\guzzlehttp\psr7\src\functions.php on

line 62

Does anyone know whats going on here, why Guzzle/DomCrawler is converting things into these weird characters?

BTW: Heres my composer.json file which I'm autoloading to include the components:

{
"require": {
  "symfony/dom-crawler": "~3.0",
  "symfony/css-selector": "~3.0",
  "guzzlehttp/guzzle": "~6.2.2",
  "fabpot/goutte": "*",
  "symfony/process": "*",
  "symfony/var-dump": "*"
  }
}

I wonder if the reason that ForceCharsetPlugin is not working, could be that I'm including some older versions of some of the components it uses. I haven't fully figured out how the version thing works, I don't know what the * wildcard does.


Solution

  • Sorry, I found out that this problem was only occuring when running the script through the CLI. When I opened it in a browser, the encoding was fine: https://i.gyazo.com/f488c8a3cbe25cae5c1b368b992b1c53.png