Reading Binary File Piecemeal and Converting to Integers With Memory Efficiency

Using Swift, I need to read integers from a binary files but can't read whole files into memory because of their size. I have 61G bytes(7.7 billion Integers) of data written into a dozen files of various sizes. The largest is 18G bytes(2.2 billion Integers). Some of the files might be read completely into memory but the largest is greater than available RAM.

Insert File IO Rant Here.

I have written the code to write the file 10 Million bytes at a time and it works well. I wrote this as a class but none of the rest of the code is object oriented. This is not an App so there is no idle time to do memory cleanup. Here is the code:

class BufferedBinaryIO {
    var data = Data(capacity: 10000000)
    var data1:Data?
    let fileName:String!
    let fileurl:URL!
    var fileHandle:FileHandle? = nil
    var (forWriting,forReading) = (false,false)
    var tPointer:UnsafeMutablePointer<UInt8>?
    var pointer = 0

    init?(forWriting name:String) {
        forWriting = true
        fileName = name
        fileurl =  URL(fileURLWithPath:fileName)
        if FileManager.default.fileExists(atPath: fileurl.path) {
            try! fileHandle = FileHandle(forWritingTo: fileurl)
            if fileHandle == nil {
                print("Can't open file to write.")
                return nil
            }
        }
        else {
            // if file does not exist write data for the first time
            do{
                try data.write(to: fileurl, options: .atomic)
                try fileHandle = FileHandle(forWritingTo: fileurl)
            } catch {
                print("Unable to write in new file.")
                return nil
            }
        }
    }
    
    init?(forReading name:String) {
        forReading = true
        fileName = name
        fileurl =  URL(fileURLWithPath:fileName)
        if FileManager.default.fileExists(atPath: fileurl.path) {
            try! fileHandle = FileHandle(forReadingFrom: fileurl)
            if fileHandle == nil {
                print("Can't open file to write.")
                return nil
            }
        }
        else {
            // if file does not exist write data for the first time
            do{
                try fileHandle = FileHandle(forWritingTo: fileurl)
            } catch {
                print("Unable to write in new file.")
                return nil
            }
        }
    }
    
    deinit {
        if forWriting {
            fileHandle?.seekToEndOfFile()
            fileHandle?.write(data)
        }
        try? fileHandle?.close()
            
    }
    
    func write(_ datum: Data) {
        guard forWriting else { return }
        self.data.append(datum)
        if data.count == 10000000 {
            fileHandle?.write(data)
            data.removeAll()
        }
    }
    
    func readInt() -> Int? {
        if data1 == nil || pointer == data1!.count {
            if #available(macOS 10.15.4, *) {
                //data1?.removeAll()
                //data1 = nil
                data1 = try! fileHandle?.read(upToCount: 10000000)
                pointer = 0
            } else {
                // Fallback on earlier versions
            }
        }
        if data1 != nil && pointer+8 <= data1!.count {
            let retValue = data1!.withUnsafeBytes { $0.load(fromByteOffset: pointer,as: Int.self) }
            pointer += 8
           // data.removeFirst(8)
            return retValue
        } else {
            print("here")
        }

        return nil
    }
}

As I said writing to the file works fine and I can read from the file but I have a problem.

Some of the solutions for reading binary and converting it to various types use code like:

let rData = try! Data(contentsOf: url)
let tPointer = UnsafeMutablePointer<UInt8>.allocate(capacity: rData.count)
rData.copyBytes(to: tPointer, count: rData.count)

The first line reads in the whole file consuming a like amount of memory and the next two lines double the memory consumption. So even if I have 16G bytes of Ram I can only read an 8Gbyte file because it has to double consume memory.

As you can see my code does not use this code. For the read I just read the file into data1, 10 million bytes at a time, and then use data1 like it was a regular data type and access it and can read the data fine, without doubling the memory usage.

The code in the body of the program that uses this code looks like:

        file loop .... {

            let string = String(format:"~path/filename.data")
            let dataPath = String(NSString(string: string).expandingTildeInPath)
            let fileBuffer = BufferedBinaryIO(forReading: dataPath)
            
            while let value = fileBuffer!.readInt() {
                loop code
            }
        }

Here is my problem: This code works to read the file into Ints but inside readInt, the code does not release the memory from the previous fileHandle?.read when it does the next fileHandle?.read. So as I go through the file the memory consumption goes up 10 million each time it fills the buffer until the program crashes.

Forgive my code as it is a work in progress. I keep changing it to try out different things to fix this problem. I used data1 as an optional variable for the read portion of the code, thinking setting it to nil would deallocate the memory. It does the same thing when I just over write it.

That being said, this would be a nice way to code this if it worked.

So the question is do I have a memory retention cycle or is there a magic bean I need to use on data1 get it to stop doing this?

Thank you in advance for your consideration of this problem.

Solution

You don't show your code that actually reads from your file, so it's a bit hard to be sure what's going on.

From the code you did show we can tell you're using a FileHandle, which allows random access to a file and reading arbitrary-sized blocks of data.

Assuming you're doing that part right and reading 10 million bytes at a time, your problem may be the way iOS and Mac OS handle memory. For some things, the OS puts no-longer-used memory blocks into an "autorelease pool", which gets freed when your code returns and the event loop gets serviced. If you're churning through multiple gigabytes of file data synchronously, it might not get a chance to release the memory before the next pass.

(Explaining Mac OS/iOS memory management in enough detail to cover autoreleasing would be pretty involved. If you're interested, I suggest you look up Apple manual reference counting and automatic reference counting, a.k.a ARC, and look for results that explain what goes on "under the covers".)

Try putting the code that reads 10 million bytes of data into the closure of an autoreleasePool() statement. That will cause any autoreleased memory to actually get released. Something like the pseudo-code below:

while (more data) {
    autoreleasepool {
        // read a 10 million byte block of data
        // process that block
    }
}