Base64 canonical encodings

I'm trying to test that encode and decode functions (defined in Data.ByteString.Base64.Lazy) are inverse:

import qualified Data.ByteString.Lazy as BL

encoded :: Gen BL.ByteString
encoded = do
  body <- concat <$> listOf (group 0)
  end <- group =<< choose (0, 2)
  return . BL.pack $ body <> end
 where
  group :: Int -> Gen [Word8]
  group pad = do
    letters <- vectorOf (4 - pad)
      . elements . map (fromIntegral . ord)
      $ ['A'..'Z'] <> ['a'..'z'] <> ['0'..'9'] <> ['+','/','=']
    return $ letters <> replicate pad 61  -- 61 is ascii for =

prop_encDec = forAll encoded $ \b ->
  [b] == (encode <$> rights [decode b])

But QuickCheck discovers a problem there:

=== prop_encDec ===
*** Failed! Falsifiable (after 1 test):
"1yx="

I have investigated that the problem must be related to non-canonical encodings, but I have difficulties understanding what is it and how to deal with it. Can you please explain on this example why decoding “1yx=” and re-encoding produces “1yw=”.

Solution

The problem is that the x does contain ones at positions that are not relevant for the encoding.

Let me elaborate: The base64 string 1yx= will be decoded to following binary pattern

base64: 1      y      x      =
binary: 000001 110010 110001 000000

however the = at the end of the string tells the encoder that the last 8 bits are not relevant so

000001 110010 110001 000000
^^^^^^ ^^^^^^ ^^^^

only the marked bits will be decoded to

binary: 000001 110010 1100

as you can see the last two bits (01) encoded by the x are ignored

Then if we encode the decoded data the encoder will fill up the stream of bits with zero bits, resulting in

binary: 000001 110010 110000 000000
                          ^^ ^^^^^^
base64: 1      y      w      =

So to conclude: The encoder can't "correctly" re-encode the decoded 1yx= because it contains information that gets lost due to decoding.

For your testing purposes I would suggest to swap the order of operation. So generate some random string as input, encode it and then decode it and compare it to the original input.

You also might want to checkout the example section of the Wikipedia article for Base64 encoding. It contains goods examples regarding the padding of data.

On canonical and non-canonical encoding:

A base64 string is canonical if and only if all padded bits are zero, if one of the padded bits is not zero the string is non-canonical. So if a base64 string has one = at the end, then the last 8 bits must be zero for it to be canonical and if the string has two = at the end then the last 16 bits must be zero.

So the string 1yx= is non-canonical, because as we saw above one of the padded bits is one. On the other hand the string 1yw= is canonical because all 8 padded bits are zero.