Search code examples
screen-scrapinghtml-parsing

How to scrape logos from websites?


First off, this is not a question about how to scrape websites. I am fully aware of the tools available to me to scrape (css_parser, nokogiri, etc. I'm using Ruby to do the scraping).

This is more of an overarching question on the best possible solution to scrape the logo of a website starting with nothing but a website address.

The two solutions I've begun to create are these:

  1. Use Google AJAX APIs to do an image search that is scoped to the site in question, with the query "logo", and grab the first result. This gets the logo, I'd say, about 30% of the time.
  2. The problem with the above is that Google doesn't really seem to care about CSS image replaced logos (ie. H1 text that is image replaced with the logo). The solution I've tentatively come up with is to pull down all CSS files, scan for url() declarations, and then look for the words header or logo in the file names.

Solution two is problematic because of the many idiosyncrasies of all the people who write CSS for websites. They use Header instead of logo in the file name. Sometimes the file name is random, saying nothing about a logo. Other times, it's just the wrong image.

I realize I might be able to do something with some sort of machine learning, but I'm on a bit of a deadline for a client and need something fairly capable soon.

So with all that said, if anyone has any "out of the box" thinking on this one, I'd love to hear it. If I can create a solution that works well enough, I plan on open-sourcing the library for any other interested parties :)

Thanks!


Solution

  • Creating an application will definetely help you, but I believe in the end there will some manual work involved. Here's what I would do.

    • Have your application store in a database a link to all images on a website that are larger than a specified dimension so that you can weed out small icons.
    • Then you can setup a form to access these results. You may want to setup the database table to store the website url and relationship between the url and image links.

    Even if it we're possible to write an application to truly figure out if it was a logo or not seems like it would be a massive amount of code. In the end, it would probably weed out even more than the above, but you have to take into account it could be faster for human to visually parse the results then the time it took for you to write and test the complex code.