I'm building a sharing site which allows to share webpage links with Ruby on Rails.
I would like to extract some representative images for each page (as on Facebook when you share a link).
For now I use the gem opengraph to parse og:image
meta tag at first, and then I use Nokogiri to parse the page content and retrieve all <img>
tags src
attributes. This give good results (except some decoration images, so I filter results by size...).
--
Now I would like to go further and parse css background-image
property : websites logo are often display as background for a <h1>
or a <a>
tag.
I think about the following process:
Parse HTML document with regex (something like /background(-image)?:.../
) to find inline CSS
Retrieve CSS stylesheets URLs with Nokogiri and parse these sheets with the same regex
... and absolutify URLs according to documents URLs.
--
My questions are :
Do you think there is a better alternative ?
Is there a library of some sort that can increase the performance of the process ?
For example, if I could build a consolidated view of HTML+CSS, which allows me to access CSS properties via the DOM, I could access only the background-images of pre-selected HTML elements (h1,a,...) and limit the number of results.
When you parse the CSS of a web site, any images you are going to get back are going to be related to the user interface (sprites, backgrounds), not the actual content of the page.
I don't think it would be worth your while unless you're just trying to extract logos. In that case I would restrict to matches on class names/ids/paths containing the word "logo".
If you want to extract "representative images" from a page, I would just parse the image tags as you are doing then generate (and crop) a screenshot of the page as per: How do I take screenshots of web pages using ruby and a unix server?
How are you handling images that aren't in the raw HTML source?
In terms of libraries, I'm pretty sure nokogiri is the best thing out there.