Search code examples
arraysswiftcryptographyuint32uint8array

Convert [UInt32] -> [UInt8] -> [[UInt8]] in Swift


I am trying to speed up my current implementation of a function that converts [UInt32] to a [UInt8] which in turn is split up into [[UInt8]] with 6 arrays at each index.

My implementation:

extension Array {
func splitBy(subSize: Int) -> [[Element]] {
    return 0.stride(to: self.count, by: subSize).map { startIndex in
        let endIndex = startIndex.advancedBy(subSize, limit: self.count)
        return Array(self[startIndex ..< endIndex])
    }
  }
}



func convertWordToBytes(fullW : [UInt32]) -> [[UInt8]] {
    var combined8 = [UInt8]()

    //Convert 17 [UInt32] to 68 [UInt8]
    for i in 0...16{
        _ = 24.stride(through: 0, by: -8).map {
            combined8.append(UInt8(truncatingBitPattern: fullW[i] >> UInt32($0)))
        }
    }

    //Split [UInt8] to [[UInt8]] with 6 values at each index.
    let combined48 = combined8.splitBy(6) 

    return combined48
}

This function will be iterated millions of times in my program and its speed is a huge burden.

Anyone got any ideas? Thanks


Solution

  • If you profile (Cmd + I) your code, you will see that most of the time is on various "copy to buffer" functions. This happens when you append a new element to the array but it has run out of its initial allocated space so it must be moved to a location on the heap with more memory. Morals of the lesson: heap allocation is slow but unavoidable with arrays. Do it as few times as possible.

    Try this:

    func convertWordToBytes2(fullW: [UInt32]) -> [[UInt8]] {
        let subSize = 6
    
        // We allocate the array only once per run since allocation is so slow
        // There will only be assignment to it after
        var combined48 = [UInt8](count: fullW.count * 4, repeatedValue: 0).splitBy(subSize)
    
        var row = 0
        var col = 0
    
        for i in 0...16 {
            for j in 24.stride(through: 0, by: -8) {
                let value = UInt8(truncatingBitPattern: fullW[i] >> UInt32(j))
                combined48[row][col] = value
    
                col += 1
                if col >= subSize {
                    row += 1
                    col = 0
                }
            }
        }
    
        return combined48
    }
    

    Benchmark code:

    let testCases = (0..<1_000_000).map { _ in
        (0..<17).map { _ in arc4random() }
    }
    
    testCases.forEach {
        convertWordToBytes($0)
        convertWordToBytes2($0)
    }
    

    Result (on my 2012 iMac)

    Weight          Self Weight         Symbol Name
    9.35 s   53.2%  412.00 ms           specialized convertWordToBytes([UInt32]) -> [[UInt8]]
    3.28 s   18.6%  344.00 ms           specialized convertWordToBytes2([UInt32]) -> [[UInt8]]
    

    By eliminating multiple allocations, we already reduced the run time by 60%. But each test case is independent, which lends itself perfectly to parallel processing with today's multi-core CPU. A modified loop...:

    dispatch_apply(testCases.count, dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_HIGH, 0)) { i in
        convertWordToBytes2(testCases[i])
    }
    

    ... will shave about 1 second off the wall time when executed on my quad-core i7 with 8 threads:

    Weight    Self Weight       Symbol Name
    2.28 s    6.4%  0 s         _dispatch_worker_thread3  0x58467
    2.24 s    6.3%  0 s         _dispatch_worker_thread3  0x58463
    2.22 s    6.2%  0 s         _dispatch_worker_thread3  0x58464
    2.21 s    6.2%  0 s         _dispatch_worker_thread3  0x58466
    2.21 s    6.2%  0 s         _dispatch_worker_thread3  0x58465
    2.21 s    6.2%  0 s         _dispatch_worker_thread3  0x58461
    2.18 s    6.1%  0 s         _dispatch_worker_thread3  0x58462
    

    The time saving is not as much as I hoped for. Apparently there's some contention when accessing the heap memory. For anything even faster, you should explore a C-based solution.