arrays algorithm random compression random-seed

what is the smallest representation of a randomized array?

Given an array of arbitrary length, and arbitrary values, write an equation: 'E' to find a simplified (compressed) representation: 'R' for the randomized output of the array: 'O' such that 'R' fed into 'E' = 'O'.

For example; suppose we were given as input an array with a length of 10 containing values that correlate to their index.

If sorted, the array would be Array = [0,1,2,3,4,5,6,7,8,9].

The raw input array would be some random order of these indices. Lets use Array = [9,5,8,2,1,0,6,3,4,7].

Find 'R' such that when 'R' is applied to [0,1,2,3,4,5,6,7,8,9], 'O' = [9,5,8,2,1,0,6,3,4,7].

I am open to nearly any solution to this problem in any language so long as output 'R' meets the following conditions.

#1. Output 'R' is smaller in memory than storing the array of indices literally.

#2. Output 'R' is not simply a directly compressed version of the input run through something like LZ77 or LZSS. Output 'R' must be a novel representation of the random order rather than a derivative of the input.

#3. Output 'R' when compared to the input has an average compression ratio of at least ~2:1.

#4. Output 'R' has a constant fixed size in memory for a given array length.

To elaborate, if 'R' requires 3 bytes of storage to recreate [9,5,8,2,1,0,6,3,4,7], then the expectation is that any random input order of 10 elements can be stored in 'R' using 3 bytes. It is acceptable for the storage size of 'R' to increase linearly with array size. Though bonus points if you find a solution that does not increase in size.

As a starting point, my best guess as to how this would be accomplished is by using a random number generator as 'E' and a seed value as 'R' such that you get output 'O'. The difficultly is that the seed value is the unknown variable and thus you will have to work backwards to find it from the randomized input. I would roughly imagine you would want to perform some sort of operation like a Fisher-Yates shuffle (or equivalent) to reconstruct 'O' from a sorted array, then, figure out the inverse of this operation to go from a randomized input array to some seed value 'R'. I am unaware of a mathematical method to accomplish this other than brute forcing it and checking every seed value until you get a match. (which is not a good option). This is why I said I was open to nearly any solution as there might be a better option that exists rather than using a random number generator. But if there is, I am unaware of it.

Some additional leeway can be accepted if output 'R' has a hard limit of size reduction such that for very small array lengths, it is actually cheaper to store the randomized input directly. The above example is only 10 elements long, and as such is already pretty small. In practice this solution is needed to compress arrays with lengths well into the billions and beyond. So if your solution 'R' is only smaller for arrays with a length longer than 'X', it will still be a valid solution so long as 'X' is a reasonable number such as something in the hundreds or thousands and not in the millions and above.

As a final reminder, we are not concerned with the values contained in the array, only the indices. Only the order of the elements. For all intents and purposes we can assume that every value in the input array is an integer representing some other index in the array.

I recognize that this is a difficult problem, so all feedback and interest is appreciated. Thank you in advance for your time and contributions.

Solution

You say "arbitrary values", but then your examples are not at all arbitrary, but rather a permutation of the indices 0 to n-1. If you really mean "arbitrary", then there is no compression possible.

So I'm going to go with the permutation as what you meant. Then it seems you're really asking how to represent a permutation of n things in as few bits as possible.

All we need to do is count how many permutations there are in order to set a lower bound on how many bits it will take to represent. There are n! permutations, so the number of bits is lg(n!), where lg is the base 2 logarithm. As noted in a comment, we can use Stirling's approximation for factorial to get an idea of how well we can do. For large n, we get about n (lg(n) – 1.44) bits (good for n around 30 or more).

Now we need to figure out how many bits are needed to represent the array without compression. Let's assume that we spend a few constant bits to represent the number of bits per value. Then each value will take ceil(lg(n)) bits, where ceil is the ceiling function. We have n of them, so it will take n ceil(lg(n)) bits to represent directly. (I could leave off the last one, since I know that one must the missing one. But that only removes one value out of the n, so we'll let that go, resulting in very slightly optimistic compression factors below.)

The ceil function increases the logarithm by less than one, and not at all if n is a power of two. We can approximate it as n lg(n), and in fact there are representations that combine values to approach that limit without the ceiling. So really just n lg(n). Alas, n lg(n) looks a lot like n (lg(n) – 1.44).

We didn't get very far with this compression. All we have going for us is that –1.44 n in the factorial estimate. We get compression factors of around 1.5 at n=10, continuing down to around 1.3 at n=100 and around 1.1 at n=10⁵. Far from the factor of two that you're looking for.

And, no, no random number generator is going to be able to beat the theoretical limit. Sorry.