Search code examples
rssxml-rpcatom-feed

What's the universal standard to get data from any blog?


I want to extract data from various kinds of blogs and was going through various ways to do it:

  1. API which needs user authentication
  2. XML RPC(Don't know which all support it)
  3. RSS(Again, not sure which blogs support it and even if they do, how much can one get from RSS feeds.)
  4. Atom

I know that this isn't a strictly programming related question but I went forward in asking this as there is heck lot of confusion as to what to use and which is better served?

It would be nice to not use API with Authentication as you not only will have to tackle with varied implementations of Authentication, you also have to deal with varied API limits.


Solution

  • RSS is the oldest that came into use. There are limitations to it. Atom was designed to be the replacement for it, overcoming the limitations of RSS. Atom is just a specialised form of XML RPC. In other words, there are other uses for XML RPC, and Atom is the variation of it you want. All of the above are a type of API. So ideally what you want to do is support RSS and Atom. Sadly Atom and RSS are not backwards compatible. To quote the Wikipedia on "Atom":

    In particular, many blog and wiki sites offer their web feeds in the Atom format.

    @porneL's solution is not recommended (at the moment). However in the future, HTML markup is set to change to improve the semantic meaning given to blocks, such as the new <article> tag. This will be yet another way to parse documents. It will be the most versatile, but in my opinion it will be a very long time before it becomes reliable, since many if not most sites suffer from 'tag soup' syndrome.