Search code examples
phphtml-parsinginternal-server-errorgoogle-scholar

Google Scholar Server Error HTML Parser


Up until just this week I was able to use a simple html dom parser to scrape content off google scholar. (Yes I'm aware they don't want people doing that hence no API).

Yet in the past day or two it's stopped displaying content. When attempting a simple file_get_html or a url there is an error of:

Server Error We're sorry but it appears that there has been an internal server error while processing your request. Our engineers have been notified and are working to resolve the issue.Please try again later.

I've seen other questions out there, but the solutions are mostly R specific or are using cURL. Does anyone have suggestions to tweak my simple php function, especially to call twice? Or am I out of luck as Google is now closing this door?

My code:

<?php require_once('assets/functions/simple_html_dom.php');
$google_id = get_post_meta($post->ID, 'ecpt_google_id', true);
$google = new simple_html_dom;
$google_url = 'http://scholar.google.com/citations?user=' . $google_id . '&pagesize=10';
$older_pubs = 'http://scholar.google.com/citations?user=' . $google_id;
$google = file_get_html($google_url);

foreach($google->find('tr.gsc_a_tr') as $article) {
    $item['title']  = $article->find('td.gsc_a_t a', 0)->plaintext;
    $item['link']   = $article->find('a.gsc_a_at', 0)->href;
    $item['pub']    = $article->find('td.gsc_a_t .gs_gray', 1)->plaintext;
    $item['year']   = $article->find('td.gsc_a_y', 0)->plaintext;

    ?>
    <p class="pub"><b><a href="http://scholar.google.com<?php echo $item['link'];?>"><?php echo $item['title']; ?></a></b></p>
    <h6 class="pub"><?php echo $item['year']; ?>, <?php echo $item['pub']; ?></h6>


    <?php } ?>
<p align="right"><b><a href="<?php echo $older_pubs; ?>">View Publications</a></b></p>

Solution

  • Google scholar is not accessible without accepting cookies anymore. An "server error" occurs if you try to access with curl/wget/...

    Try to accept cookies, for curl/php see: Google Server gives a server error with the first request in private browsing mode

    Then load page twice (first accepting cookie and server error, second you get content.)