Search code examples
facebookdebuggingweb-scrapingfacebook-opengraph

Scraper fails on files over ~390KB


Does the Facebook's URL scraper have a size limitation on it? We have several books available on a website. Those that have an HMTL filesize under a certain size (~390KB) get scraped and read properly but the 4 that are larger do not. These larger items get a 200 response code and the canonical URL opens.

All of these pages are built using the same template, the only differences being the size of the content within each book and the number of links each book makes to other pages on the site.

  1. click on canonical URL
  2. Open Firebug In Firefox or developer tools in Chrome to network tab 3, The *.html size at >~390KB for the listed failures & <~390K for the successes
  3. Click on "See exactly what our scraper sees for your URL"
  4. Blank page for failures, HTML present for successes

Failures:

Successes:


Solution

  • A solution for your problem might be to check whether a real user or the Facebook bot is visiting your page. If it is the bot, then render only the necessary meta data for it. You can detect the bot via its user agent which according to the Facebook documentation is:
    "facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)"

    The code would look something like this (in PHP):

    function userAgentIsFacebookBot() {
        if ($_SERVER['HTTP_USER_AGENT'] == "facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)") {
            return true;
        }
        return false;
    }