Search code examples
objective-cswiftperformancensdata

Swift Data initialization slow and memory inefficient with large dataset


If I try and initialize a Swift Data struct with a relatively large MutableRandomAccessSlice<Data> the program starts grows large in memory use and takes a long time to finish. However, doing the same thing in Objective-C with NSData appears to not have the same problem.

For example, with the following code:

let startData = Data(count: 100_000_000)
let finalData = Data(startData[0..<95_234_877])

if I compile it using:

xcrun swiftc -O -sdk `xcrun --show-sdk-path --sdk macosx` -o output main.swift

the execution (on my MacBook Air 2011) takes a long time to finish (87s) and the memory usage is through the roof (see up to 625MB below):

$ time ./output
./output  85.21s user 1.29s system 99% cpu 1:26.91 total

$ top -o MEM
PID    COMMAND      %CPU  TIME     #TH   #WQ  #PORT MEM    PURG   CMPRS  PGRP  PPID  STATE
38156  output       99.0  01:25.57 1/1   0    10    625M+  0B     992M+  38156 36025 running

If I profile each step it takes about 0.00015s to create startData, 0.000007s to create the slice from startData, and the rest of the time to initialize finalData.

If I do the same thing in Objective-C:

NSData *startData = [[NSMutableData alloc] initWithLength:100000000];
NSData *finalData = [startData subdataWithRange:NSMakeRange(0, 95234877)];

it only takes roughly 0.00017s.

Am I doing something wrong in the Swift example? There seems to be a very large discrepency between the two.


Solution

  • As you have found, the Objective-C code [startData subdataWithRange:NSMakeRange(0, 95234877)] is equivalent to startData.subdata(in: 0..<95_234_877).

    When you write Data(startData[0..<95_234_877]), Swift calls public convenience init<S : Sequence where S.Iterator.Element == Iterator.Element>(_ elements: S) of RangeReplaceableCollection, it's defined in RangeReplaceableCollection.swift.gyb. The core part of the implementation is like this:

    for element in newElements {
      append(element)
    }
    

    You know repeating append to a collection may be inefficient.

    And, if you want to initialize a Data from [UInt8], you'd better call an initializer specific for [UInt8]:

    let data = Data(bytes: [UInt8](repeating: 0, count: 10_000_000))
    

    Data([UInt8](repeating: 0, count: 100_000_000)) calls the initializer in RangeReplaceableCollection noted above.


    In my opinion, Swift should optimize such default implementations much more, but hard to make them as efficient as type specific operations.