What algorithm is used in a hash partition that it can divide up the data into similar size partitions before seeing any data

I understand that the hash partitioning method in Oracle (and other databases), generate an algorithm on the hash key such that the incoming data can divided up into somewhat equal size pieces to create similar size partitions.

But how can such an algorithm work before seeing the actual data first? Wouldn't it be possible to end up with a set of hash partitions where 99% of the data goes into one of the partitions and the remaining 1% is divided up into the remaining n partitions?

Solution

Yes, it is theoretically possible. But hash algorithms essentially randomize the incoming data. The rules of statistics then imply that the partitions will be close in size.

If you have large data, then differences even as small as 1% in the size of the partitions becomes quite unlikely -- assuming the original data has unique values.

However, if the original data is skewed, then the resulting bins may be skewed as well. For instance consider if you have 100 values, with 0-9 occurring once and 10 occurring 90 times. All 90 rows with the value 10 will go into the same partition, so the resulting bins will be unbalanced.

Oracle uses the function ORA_HASH for deciding which hash partition to use. The exact algorithm of that function is not publicly known. But the manual does discuss a few properties of that algorithm; ORA_HASH works best with unique data and when the number of buckets is a power of 2. If those conditions aren't met it's possible for some of the partitions to be significantly larger than the others.