I'm using Guzzle and DomCrawler to scrape data from a webpage, everythings working well except for one issue. Its inserting weird characters into the data that I scrape. Heres an example:
[2]=>
array(4) {
["cell_lines"]=>
string(4) "A549"
["cancer"]=>
string(4) "Lung"
["ic50"]=>
string(7) ">40 ┬ÁM"
["pmid"]=>
string(8) "10380632"
}
[3]=>
array(4) {
["cell_lines"]=>
string(16) "B16 melanoma 4A5"
["cancer"]=>
string(4) "Skin"
["ic50"]=>
string(7) ">40 ┬ÁM"
["pmid"]=>
string(8) "10380632"
}
[4]=>
array(4) {
["cell_lines"]=>
string(9) "TGBC11TKB"
["cancer"]=>
string(7) "Stomach"
["ic50"]=>
string(7) ">40 ┬ÁM"
["pmid"]=>
string(8) "10380632"
}
The value >40 ┬ÁM
The value thats supposed to be there is >40 µM
But its not just Greek letters its doing this to, heres another example:
["properties"]=>
array(6) {
["logp"]=>
string(5) "á2.85"
["vdw_volume"]=>
string(8) " 239.67"
["polar_surface_area"]=>
string(7) " 75.99"
["refractivity"]=>
string(8) " 363.43"
["mass"]=>
string(9) " 284.068"
["formula"]=>
string(10) " C16H12O5"
As far as I can see, there are only
spacers before these numeric values. Its converting everything into   for some reason. if I wrap everything in utf8_decode($crawler->text())
Heres what I get:
["properties"]=> array(6) { ["logp"]=> string(5) "?2.85" ["vdw_volume"]=> string(7) "á239.67" ["polar_surface_area"]=> string(6) "á75.99" ["refractivity"]=> string(7) "á363.43" ["mass"]=> string(8) "á284.068" ["formula"]=> string(9) "áC16H12O5"
so all that changes is I get á
instead of ┬Á
I have tried creating the Crawler instance like this:
$crawler = new Crawler('','http://crdd.osdd.net/raghava/npact/');
$crawler->addHTMLContent($raw, 'UTF-8');
It changes nothing. I tried adding this header to the top of the file:
header('Content-Type: text/html; charset=utf8;');
It had no effect.
Heres how I'm opening the Guzzle client:
$client = new Client(array(
'base_uri' => 'http://crdd.osdd.net/'
));
$response = $client->request('GET','raghava/npact/brws_alp.php?b=A');
https://gist.github.com/pschultz/6554265#file-forcecharsetplugin-php
I tried installing the ForceChartSet plugin, which I found here:
and implemented it like this:
// create http client instance
$client = new Client(array(
'base_uri' => 'http://crdd.osdd.net'
));
$plugin = new ForceCharsetPlugin();
$plugin->setForcedCharset('utf8');
// Guzzle only
$client->addSubscriber($plugin);
and I get this error:
Fatal error: Uncaught exception 'InvalidArgumentException' with message 'URI must be a string or UriInterface' in C:\wamp64\www\spider\osdd\vendor\guzzlehttp\psr7\src\functions.php:62 Stack trace:
0 C:\wamp64\www\spider\osdd\vendor\guzzlehttp\guzzle\src\Client.php(142):
GuzzleHttp\Psr7\uri_for(Object(ForceCharsetPlugin))
1 C:\wamp64\www\spider\osdd\vendor\guzzlehttp\guzzle\src\Client.php(115):
GuzzleHttp\Client->buildUri(Object(ForceCharsetPlugin), Array)
2 C:\wamp64\www\spider\osdd\vendor\guzzlehttp\guzzle\src\Client.php(129):
GuzzleHttp\Client->requestAsync('addSubscriber', Object(ForceCharsetPlugin), Array)
3 C:\wamp64\www\spider\osdd\vendor\guzzlehttp\guzzle\src\Client.php(87):
GuzzleHttp\Client->request('addSubscriber', Object(ForceCharsetPlugin), Array)
4 C:\wamp64\www\spider\osdd\osdd_data.php(185): GuzzleHttp\Client->__call('addSubscriber', Array)
5 C:\wamp64\www\spider\osdd\osdd_data.php(185): GuzzleHttp\Client->addSubscriber(Object(ForceCharsetPlugin))
6 {main} thrown in C:\wamp64\www\spider\osdd\vendor\guzzlehttp\psr7\src\functions.php on
line 62
Does anyone know whats going on here, why Guzzle/DomCrawler is converting things into these weird characters?
BTW: Heres my composer.json
file which I'm autoloading to include the components:
{
"require": {
"symfony/dom-crawler": "~3.0",
"symfony/css-selector": "~3.0",
"guzzlehttp/guzzle": "~6.2.2",
"fabpot/goutte": "*",
"symfony/process": "*",
"symfony/var-dump": "*"
}
}
I wonder if the reason that ForceCharsetPlugin
is not working, could be that I'm including some older versions of some of the components it uses. I haven't fully figured out how the version thing works, I don't know what the * wildcard does.
Sorry, I found out that this problem was only occuring when running the script through the CLI. When I opened it in a browser, the encoding was fine: https://i.gyazo.com/f488c8a3cbe25cae5c1b368b992b1c53.png