According to Wikipedia:
[Ascii85 uses] the ASCII characters 33 (!) through 117 (u) inclusive (to represent the base-85 digits 0 through 84), together with the letter z (as a special case to represent a 32-bit 0 value).
[btoa] Version 4.2 added a "y" exception for a group of all ASCII space characters
While 0 data might be quite common, that use of z
to compress 0's seems like an arbitrary optimization that won't always be of use.
Likewise, the less frequent use of y
is only of use if the raw bytes contain adjacent spaces. The Unicode encoding of space is actually 20 00
so 0x20202020
isn't all that common in Unicode texts.
Binary data does often have adjacent 00
's, but it also often contains adjacent FF
's.
Text data does often contain adjacent spaces, but it also often contains adjacent tab characters, or adjacent new-line characters.
It would seem that a frequency analysis, and usage of 9 or 10 characters (Ascii chars 118-126/127, or v
through ~
/DEL) to represent the 9/10 most frequent 32-bit values, might lead to better compression.
The mapping of compression-character to 32-bit value could perhaps sit at the start of the encoded string enclosed between <[
and ]>
. For 32-bit values that are 4 repeated bytes, the 32-bit value can be abbreviated to the repeated hex value(s).
For example:
The binary data (192 bytes):
00 00 00 00 FF FF FF FF 20 20 20 20 2D 2D 2D 2D 09 09 09 09 0D 00 0A 00
00 00 00 00 FF FF FF FF 20 20 20 20 2D 2D 2D 2D 09 09 09 09 0D 00 0A 00
00 00 00 00 FF FF FF FF 20 20 20 20 2D 2D 2D 2D 09 09 09 09 0D 00 0A 00
00 00 00 00 FF FF FF FF 20 20 20 20 2D 2D 2D 2D 09 09 09 09 0D 00 0A 00
00 00 00 00 FF FF FF FF 20 20 20 20 2D 2D 2D 2D 09 09 09 09 0D 00 0A 00
00 00 00 00 FF FF FF FF 20 20 20 20 2D 2D 2D 2D 09 09 09 09 0D 00 0A 00
00 00 00 00 FF FF FF FF 20 20 20 20 2D 2D 2D 2D 09 09 09 09 0D 00 0A 00
00 00 00 00 FF FF FF FF 20 20 20 20 2D 2D 2D 2D 09 09 09 09 0D 00 0A 00
Note the presence of spaces
20
, hyphens2D
, tabs09
and Unicode Carriage Return-Line Feeds0D 00 0A 00
Could be encoded as (79 bytes)
<[00;FF;20;2D;09;0D000A00]><~vxyz{|vxyz{|vxyz{|vxyz{|vxyz{|vxyz{|vxyz{|vxyz{|~>
Is there merit in an encoding approach that uses such compression? Why aren't the various Ascii85 specs more aggressive with compression?
Because you would normally use a compression program before encoding with ASCII85, which can do a much better job than the suggested ad hoc encodings.