I'm trying to compress data to improve the space complexity, but I'm not sure if I'm incorrectly compressing data or incorrectly measuring the size.
I tried the following in the Playground.
import Foundation
import Compression
// Example data
struct MyData: Encodable {
let property = "Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum."
}
// I tried using MemoryLayout to measure the size of the uncompressed data
let size = MemoryLayout<MyData>.size
print("myData type size", size) // 16
let myData = MyData()
let myDataSize = MemoryLayout.size(ofValue: myData)
print("myData instance size", myDataSize) // 16
func run() {
// 1. This shows the size of the encoded data
guard let encoded = try? JSONEncoder().encode(myData) else { return }
print("myData encoded size", encoded) // 589 bytes
/// 2. This shows the size after using a first compression method
guard let compressed = try? (encoded as NSData).compressed(using: .lzfse) else { return }
let firstCompression = Data(compressed)
print("firstCompression", firstCompression) // 491 bytes
/// 3. Second compression method (just wanted to try a different compression method)
let secondCompression = compress(encoded)
print("secondCompression", secondCompression) // 491 bytes
/// 4. Wanted to compare the size difference between compressed and uncompressed for a bigger data so here is the array of uncompressed data.
var myDataArray = [MyData]()
for _ in 0 ... 100 {
myDataArray.append(MyData())
}
guard let encodedArray = try? JSONEncoder().encode(myDataArray) else { return }
print("myData encodedArray size", encodedArray) // 59591 bytes
print("memory layout", MemoryLayout.size(ofValue: encodedArray)) // 16
/// 5. Compressed array
var compressedArray = [Data]()
for _ in 0 ... 100 {
guard let compressed = try? (encoded as NSData).compressed(using: .lzfse) else { return }
let data = Data(compressed)
compressedArray.append(data)
}
guard let encodedCompressedArray = try? JSONEncoder().encode(compressedArray) else { return }
print("myData compressed array size", encodedCompressedArray) // 66661 bytes
print("memory layout", MemoryLayout.size(ofValue: encodedCompressedArray)) // 16
/// 6. Compression using lzma
var differentCompressionArray = [Data]()
for _ in 0 ... 100 {
guard let compressed = try? (encoded as NSData).compressed(using: .lzma) else { return }
let data = Data(compressed)
differentCompressionArray.append(data)
}
guard let encodedCompressedArray2 = try? JSONEncoder().encode(differentCompressionArray) else { return }
print("myData compressed array size", encodedCompressedArray2) // 60702 bytes
print("memory layout", MemoryLayout.size(ofValue: encodedCompressedArray2)) // 16
}
run()
// The implementation for the second compression method
func compress(_ sourceData: Data) -> Data {
let pageSize = 128
var compressedData = Data()
do {
let outputFilter = try OutputFilter(.compress, using: .lzfse) { (data: Data?) -> Void in
if let data = data {
compressedData.append(data)
}
}
var index = 0
let bufferSize = sourceData.count
while true {
let rangeLength = min(pageSize, bufferSize - index)
let subdata = sourceData.subdata(in: index ..< index + rangeLength)
index += rangeLength
try outputFilter.write(subdata)
if (rangeLength == 0) {
break
}
}
}catch {
fatalError("Error occurred during encoding: \(error.localizedDescription).")
}
return compressedData
}
The MemoryLayout
object doesn't seem to be helpful in measuring the size of encoded arrays whether or not they're compressed. I'm not sure how to measure a struct or an array of struts without encoding them with JSONEncoder
which already compresses the data.
The before/after compression for the single instance of MyData
(#1, #2, and #3) seems to show that the data is being properly compressed going from 589 bytes to 491 bytes. However, the comparison between an array of uncompressed data and an array of compressed data (#4, #5) seems to show that the size increased from 59591 to 66661 after the compression.
Finally, I tried using a different compression algorithm lzma
(#6). It reduced the size to 60702 which is lower than the previous compression, but it still wasn't smaller than the uncompressed data.
To get a bit of confusion out of the way first: MemoryLayout
gives you information about the size and structure of the layout of a type at compile time, but can't be used to determine the amount of storage an Array
value needs at runtime because the size of the Array
structure itself does not depend on how much data it contains.
Highly simplified, the layout of an Array
value looks like this:
┌─────────────────────┐
│ Array │
├──────────┬──────────┤ ┌──────────────────┐
│ length │ buffer ─┼───▶│ storage │
└──────────┴──────────┘ └──────────────────┘
1 word / 1 word /
8 bytes 8 bytes
└─────────┬─────────┘
└─▶ MemoryLayout<Array<UInt8>>.size
An Array
value stores its length, or count
(mixed in with some flags, but we don't need to worry about that) and a pointer to the actual space where the items it contains are stored. Those items aren't stored as part of the Array
value itself, but separately in allocated memory which the Array
points to. Whether the Array
"contains" 10 values or 100000 values, the size of the Array
structure remains the same: 1 word (or 8 bytes on a 64-bit system) for the length, and 1 word for the pointer to the actual underlying storage. (The size of the storage buffer, however, is exactly determined by the number of elements it is able to contain, at runtime.)
In practice, Array
is significantly more complicated than this for bridging and other reasons, but this is the basic gist; this is why you only ever see MemoryLayout.size(ofValue:)
return the same number every time. [And incidentally, the size of String
is the same as Array
for similar reasons, which is why MemoryLayout<MyData>.size
also reports 16
.]
In order to know how many bytes an Array
or a Data
effectively take up, it's sufficient to ask them for their .count
: Array<UInt8>
and Data
are both collections of UInt8
values (bytes), and their .count
will reflect the amount of data effectively stored in their underlying storage.
As for the size increase between step (4) and (5), note that
MyData
and joins them together before converting them to JSON, whileMyData
instances, joins those together, and then re-coverts them to JSONStep 5 has a few issues compared to step 4:
Hello
: there isn't a lot we can do, except maybe turn it into Hel{2}o
(where {2}
indicates a repetition of the last character 2
times)Hello
and join it 3 times, we get might get Hel{2}oHel{2}oHel{2}o
,Hello
3 times and then compressed, we could get {Hel{2}o}{3}
, which is much more compactMyData
100 times and joining all of those instances, you're repeating that metadata 100 timesMyData
instances, re-representing them as JSON decreases how compressed they are because it can't represent the binary data exactly. Instead, it has to convert each Data
blob into a Base64-encoded string, which causes it to grow againBetween these issues, it's not terribly surprising that your data is growing. What you actually want is a modification to step 4, which is compressing the joined data:
guard let encodedArray = try? JSONEncoder().encode(myDataArray) else { fatalError() }
guard let compressedEncodedArray = try? (encodedArray as NSData).compressed(using: .lzma) else { fatalError() }
print(compressedEncodedArray.count) // => 520
This is significantly better than
guard let encodedCompressedArray = try? JSONEncoder().encode(compressedArray) else { fatalError() }
print(encodedCompressedArray.count) // => 66661
As an aside: it seems unlikely that you're actually using JSONEncoder
in practice to join data in this way, and this was just for measurement here — but if you actually are, consider other mechanisms for doing this. Converting binary data to JSON in this way is very inefficient storage-wise, and with a bit more information about what you might actually need in practice, we might be able to recommend a more effective way to do this.
If what you're actually doing in practice is encoding an Encodable
object tree and then compressing that the one time, that's totally fine.