I'm currently using JSON (compressed via gzip) in my Java project, in which I need to store a large number of objects (hundreds of millions) on disk. I have one JSON object per line, and disallow linebreaks within the JSON object. This way I can stream the data off disk line-by-line without having to read the entire file at once.
It turns out that parsing the JSON code (using http://www.json.org/java/) is a bigger overhead than either pulling the raw data off disk, or decompressing it (which I do on the fly).
Ideally what I'd like is a strongly-typed serialization format, where I can specify "this object field is a list of strings" (for example), and because the system knows what to expect, it can deserialize it quickly. I can also specify the format just by giving someone else its "type".
It would also need to be cross-platform. I use Java, but work with people using PHP, Python, and other languages.
So, to recap, it should be:
Any pointers?
Have you looked at Google Protocol buffers?:
http://code.google.com/apis/protocolbuffers/
They're cross platform (C++, Java, Python) with third party bindings for PHP also. It's fast, fairly compact and strongly typed.
There's also a useful comparison between various formats here:
http://code.google.com/p/thrift-protobuf-compare/wiki/Benchmarking
You might want to consider Thrift or one of the others mentioned here as well.