During the last few weeks, I had the opportunity to read two documents:
After having read all the cool ideas in "mpeg-4" like identifying facial expression, motion of limbs of people, and sprites, I got really excited. The ideas sound very fun, maybe even fantastic, for an idea from 1999.
But then I read the "h.264" standard, and none of those ideas were there. There was a lot of discussion on how to encode pixels, but none of the really cool ideas.
What happened? Why were these ideas removed?
This is not a code question, but as a programmer I feel I should attempt to understand as much of the intent behind a specification. If the code I write adheres to the spirit in which the specification was meant to be used, it's more likely to be positioned to take advantage of the entire specification.
You seem to be making the assumption that the MPEG-4 Part 10 specification improves on MPEG-4 Part 2, while the fact is that these two specifications are unrelated, have nothing in common and were even developed by different people (MPEG developed the Part 2 specification, while ITU-T, ISO, IEC and MPEG together developed the Part 10 specification).
Keep in mind that ISO/IEC 14496 standard is a collection of specifications that apply to different aspects of audiovisual encoding. The goal of the Part 2 specification is to encode different kinds of visual objects (video, 3D objects, etc.). The goal of Part 10 is to provide a very efficient and high quality encoding for video. Other parts of the standard deal with other aspects, for example the Part 3 specification deals with audio encoding, and Parts 12 and 15 define a container file format that is most typically used to wrap Part 10 video (i.e. H.264) and Part 3 audio (i.e. AAC) into a single file, the so called .mp4 format.
I hope this helps!