For my current work on a grid generation algorithm I need an efficient way to transform three-dimensional coordinates to z-order (more precisely: three 4-Byte integers into one 8-Byte integer) and the other way round. This Wikipedia article describes it fairly well: Z-order curve. Since I am not a programmer, the solution I came up with does what it is supposed to do but might be quite naive using the mvbits intrinsic to do the bit interleaving explicitly:
SUBROUTINE pos_to_z(i, j, k, zval)
use types
INTEGER(I4B), INTENT(IN) :: i, j, k
INTEGER(I8B), INTENT(OUT) :: zval
INTEGER(I8B) :: i8, j8, k8
INTEGER(I4B) :: b
zval = 0
i8 = i-1
j8 = j-1
k8 = k-1
do b=0, 19
call mvbits(i8,b,1,zval,3*b+2)
call mvbits(j8,b,1,zval,3*b+1)
call mvbits(k8,b,1,zval,3*b )
end do
zval = zval+1
END SUBROUTINE pos_to_z
SUBROUTINE z_to_pos(zval, i, j, k)
use types
INTEGER(I8B), INTENT(IN) :: zval
INTEGER(I4B), INTENT(OUT) :: i, j, k
INTEGER(I8B) :: i8, j8, k8, z_order
INTEGER(I4B) :: b
z_order = zval-1
i8 = 0
j8 = 0
k8 = 0
do b=0, 19
call mvbits(z_order,3*b+2,1,i8,b)
call mvbits(z_order,3*b+1,1,j8,b)
call mvbits(z_order,3*b ,1,k8,b)
end do
i = int(i8,kind=I4B) + 1
j = int(j8,kind=I4B) + 1
k = int(k8,kind=I4B) + 1
END SUBROUTINE z_to_pos
Note that I prefer the input and output ranges to start with 1 instead of 0, leading to some additional calculations.
As it turns out, this implementation is rather slow. I measured the time it takes to transform and re-transform 10^7 positions:
gfortran -O0: 6.2340 seconds
gfortran -O3: 5.1564 seconds
ifort -O0: 4.2058 seconds
ifort -O3: 0.9793 seconds
I also tried different optimization options for gfortran without success. While the optimized code with ifort is already a lot faster, it is still the bottleneck of my program. It would be really helpful if someone could point me into the right direction how to do the bit interleaving more efficiently in Fortran.
Transforming from 3 co-ords to a z-order can be optimized using a look up table similar to the one described here. Since you only use 20 bits of your input values, it would be more efficient to use a look-up table with 1024 entries rather than 256, enough to index 10 bits, so that you only have to do 2 look-ups for each of your 3 input values, and modified for the case of interleaving 3 values instead of 2.
Entry n of the array stores the integer n, with its bits spread out so that bit 0 is in bit 0, bit 1 is moved to bit 3, bit 2 is moved to bit 6 and so on, with all the remaining bits set to zero. The lookup table array can be initialized like this:
subroutine init_morton_table(morton_table)
integer(kind=8), dimension (0:1023), intent (out) :: morton_table
integer :: b, v, z
do v=0, 1023
z = 0
do b=0, 9
call mvbits(v,b,1,z,3*b)
end do
morton_table(v) = z
end do
end subroutine init_morton_table
To actually interleave the values, separate your 3 input values into their low 10 bits and their high 10 bits, then use these 6 values as indices into the array and combine the looked-up values using shifts and adds to interleave the values together. Adds are equivalent to bitwise OR operations in this case because there won't be any carries given that at most one bit will be set in each bit position. Because only every 3rd bit can be set in the values in the tables, offsetting one of the values by 1 bit and another by 2 means there won't be any collisions.
subroutine pos_to_z(i, j, k, zval, morton_table)
integer, intent(in) :: i, j, k
integer(kind=8), dimension (0:1023), intent (in) :: morton_table
integer(kind=8), intent (out) :: zval
integer(kind=8) :: z, i8, j8, k8
i8 = i-1
j8 = j-1
k8 = k-1
z = morton_table(iand(k8, 1023))
z = z + ishft(morton_table(iand(j8, 1023)),1)
z = z + ishft(morton_table(iand(i8, 1023)),2)
z = z + ishft(morton_table(iand(ishft(k8,-10), 1023)),30)
z = z + ishft(morton_table(iand(ishft(j8,-10), 1023)),31)
zval = z + ishft(morton_table(iand(ishft(i8,-10), 1023)),32) + 1
end subroutine pos_to_z
You can use a similar technique to go the other way, but I don't think it will be quite as efficient. Create a lookup table of 32768 values (15 bits) that store 5 bits of the reconstructed input value. You will have to do 12 look-ups, getting 5 bits at a time for each of your three 20-bit values. Mask off the bottom 15 bits, then shift right 0, 1 and 2 bits to get your look-up indexes for k, j and i. Then shift and mask to get bits 15-29, 30-44 and 45-59 and do the same each time, shifting and adding to reconstruct k, j and i.