Search code examples
algorithmsearchrssgoogle-reader

How does Google Reader extract news items from a web page?


I was wondering how Google Reader extracts news items from a web page.

Does any of you know how it works? Or how someone can build a similar system to extract the same information from the HTML of a web page.

Obviously it is not using a standard (nor is it only reading RSS/ATOM), because Google Reader proves that it can read the content of the page regardless of how the markup looks like.


Solution

  • Google Reader does not currently do any kind of extraction of content from raw web pages. It used to have a "track changes to arbitrary pages" feature, but that was removed more than a year ago.

    When given an URL that is not that of a feed, Google Reader fetches its contents. If the contents are HTML, it looks for an autodiscovery element of the form <link rel="alternate" type="application/atom+xml" href="feed.xml">. If found, it subscribes to the feed.