Search code examples
htmlcssruby-on-railsweb-scraping

Extracting background-images from a web page / Parsing HTML+CSS


I'm building a sharing site which allows to share webpage links with Ruby on Rails.

I would like to extract some representative images for each page (as on Facebook when you share a link).

For now I use the gem opengraph to parse og:image meta tag at first, and then I use Nokogiri to parse the page content and retrieve all <img> tags src attributes. This give good results (except some decoration images, so I filter results by size...).

--

Now I would like to go further and parse css background-image property : websites logo are often display as background for a <h1> or a <a> tag.

I think about the following process:

  • Parse HTML document with regex (something like /background(-image)?:.../) to find inline CSS

  • Retrieve CSS stylesheets URLs with Nokogiri and parse these sheets with the same regex

... and absolutify URLs according to documents URLs.

--

My questions are :

  • Do you think there is a better alternative ?

  • Is there a library of some sort that can increase the performance of the process ?

    For example, if I could build a consolidated view of HTML+CSS, which allows me to access CSS properties via the DOM, I could access only the background-images of pre-selected HTML elements (h1,a,...) and limit the number of results.


Solution

  • When you parse the CSS of a web site, any images you are going to get back are going to be related to the user interface (sprites, backgrounds), not the actual content of the page.

    I don't think it would be worth your while unless you're just trying to extract logos. In that case I would restrict to matches on class names/ids/paths containing the word "logo".

    If you want to extract "representative images" from a page, I would just parse the image tags as you are doing then generate (and crop) a screenshot of the page as per: How do I take screenshots of web pages using ruby and a unix server?

    How are you handling images that aren't in the raw HTML source?

    In terms of libraries, I'm pretty sure nokogiri is the best thing out there.