Extracting background-images from a web page / Parsing HTML+CSS

I'm building a sharing site which allows to share webpage links with Ruby on Rails.

I would like to extract some representative images for each page (as on Facebook when you share a link).

For now I use the gem opengraph to parse og:image meta tag at first, and then I use Nokogiri to parse the page content and retrieve all <img> tags src attributes. This give good results (except some decoration images, so I filter results by size...).

Now I would like to go further and parse css background-image property : websites logo are often display as background for a <h1> or a <a> tag.

I think about the following process:

Parse HTML document with regex (something like /background(-image)?:.../) to find inline CSS
Retrieve CSS stylesheets URLs with Nokogiri and parse these sheets with the same regex

... and absolutify URLs according to documents URLs.

My questions are :

Do you think there is a better alternative ?
Is there a library of some sort that can increase the performance of the process ?

For example, if I could build a consolidated view of HTML+CSS, which allows me to access CSS properties via the DOM, I could access only the background-images of pre-selected HTML elements (h1,a,...) and limit the number of results.

Solution

When you parse the CSS of a web site, any images you are going to get back are going to be related to the user interface (sprites, backgrounds), not the actual content of the page.

I don't think it would be worth your while unless you're just trying to extract logos. In that case I would restrict to matches on class names/ids/paths containing the word "logo".

If you want to extract "representative images" from a page, I would just parse the image tags as you are doing then generate (and crop) a screenshot of the page as per: How do I take screenshots of web pages using ruby and a unix server?

How are you handling images that aren't in the raw HTML source?

In terms of libraries, I'm pretty sure nokogiri is the best thing out there.