you guys ever saw that FB scrapes the link you post on facebook (status, message etc.) live right after you paste it in the link field and displays various metadata, a thumb of the image, various images from the a page link or a video thumb from a video related link (like youtube).
any ideas how one would copy this function? i'm thinking about a couple gearman workers or even better just javascript that does a xhr requests and parses the content based on regex's or something similar... any ideas? any links? did someone already tried to do the same and wrapped it in a nice class? anything? :)
thanks!
FB scrapes the meta tags from the HTML.
I.e. when you enter a URL, FB displays the page title, followed by the URL (truncated), and then the contents of the <meta name="description"> element.
As for the selection of thumbnails, I think maybe FB chooses only those that exceed certain dimensions, i.e. skipping over button graphics, 1px spacers, etc.
Edit: I don't know exactly what you're looking for, but here's a function in PHP for scraping the relevant data from pages.
This uses the simple HTML DOM library from http://simplehtmldom.sourceforge.net/
I've had a look at how FB does it, and it looks like the scraping is done at server side.
class ScrapedInfo { public $url; public $title; public $description; public $imageUrls; } function scrapeUrl($url) { $info = new ScrapedInfo(); $info->url = $url; $html = file_get_html($info->url); //Grab the page title $info->title = trim($html->find('title', 0)->plaintext); //Grab the page description foreach($html->find('meta') as $meta) if ($meta->name == "description") $info->description = trim($meta->content); //Grab the image URLs $imgArr = array(); foreach($html->find('img') as $element) { $rawUrl = $element->src; //Turn any relative Urls into absolutes if (substr($rawUrl,0,4)!="http") $imgArr[] = $url.$rawUrl; else $imgArr[] = $rawUrl; } $info->imageUrls = $imgArr; return $info; }