Search code examples
androidzipapache-commons

How to parse a ZIP file from by-demand-generated-InputStream without having a large byte-array?


Background

I've been trying to figure out how to deal with problematic ZIP files using a stream out of them.

The reasons for it:

  1. Some ZIP files are not originated from a file-path. Some are from some Uri, some are even within another ZIP file.

  2. Some ZIP files are quite problematic in opening, so together with the previous point, it is impossible to use just what the framework has to offer. Example of such files as "XAPK" files from APKPure website (example here).

As one of the possible solutions I've searched for, I asked about memory allocation via JNI, to hold the entire ZIP file inside the RAM, while using Apache's ZipFile class which can handle a zip file in various ways and not just via file-path.

The problem

Such a thing seems to work (here) very well, but it has some problems:

  1. You don't always have the available memory.
  2. You don't know for sure what's the max memory you are allowed to allocate without risking your app from crashing
  3. In case you've accidentally chosen too much memory to allocate, you won't be able to catch it, and the app will crash. It's not like on Java, where you can safely use try-catch and save you from OOM (and if I'm wrong and you can, please let me know, because that's a very good thing to know about JNI) .

So, let's assume you can always create the InputStream (via Uri or from within an existing zip file), how could you parse it as a zip file?

What I've found

I've made a working sample that can do it, by using Apache's ZipFile, and letting it traverse the Zip file as if it has all in memory.

Each time it asks to read some bytes from some position, I re-create the inputStream.

It works fine, but the problem with this is that I can't optimize it well to minimize the amount of times I re-create the inputStream. I tried to at least cache the current InputStream, and that if it's good enough, re-use it (skip from current position if needed), and if the required position is before current one, re-create the inputStream. Sadly it failed in some cases (such as the XAPK file I've mentioned above), as it causes EOFException.

Currently I've worked only with Uri, but a similar solution can be done for InputStream you re-generate from within another zip-file.

Here's the inefficient solution (sample available here, including both inefficient solution and the one I tried to make better), which seems to always work well:

InefficientSeekableInputStreamByteChannel.kt

@RequiresApi(Build.VERSION_CODES.N)
abstract class InefficientSeekableInputStreamByteChannel : SeekableByteChannel {
    private var position: Long = 0L
    private var cachedSize: Long = -1L
    private var buffer = ByteArray(DEFAULT_BUFFER_SIZE)
    abstract fun getNewInputStream(): InputStream

    override fun isOpen(): Boolean = true

    override fun position(): Long = position

    override fun position(newPosition: Long): SeekableByteChannel {
//        Log.d("AppLog", "position $newPosition")
        require(newPosition >= 0L) { "Position has to be positive" }
        position = newPosition
        return this
    }

    open fun calculateSize(): Long {
        return getNewInputStream().use { inputStream: InputStream ->
            if (inputStream is FileInputStream)
                return inputStream.channel.size()
            var bytesCount = 0L
            while (true) {
                val available = inputStream.available()
                if (available == 0)
                    break
                val skip = inputStream.skip(available.toLong())
                if (skip < 0)
                    break
                bytesCount += skip
            }
            bytesCount
        }
    }

    final override fun size(): Long {
        if (cachedSize < 0L)
            cachedSize = calculateSize()
//        Log.d("AppLog", "size $cachedSize")
        return cachedSize
    }

    override fun close() {
    }

    override fun read(buf: ByteBuffer): Int {
        var wanted: Int = buf.remaining()
//        Log.d("AppLog", "read wanted:$wanted")
        if (wanted <= 0)
            return wanted
        val possible = (calculateSize() - position).toInt()
        if (possible <= 0)
            return -1
        if (wanted > possible)
            wanted = possible
        if (buffer.size < wanted)
            buffer = ByteArray(wanted)
        getNewInputStream().use { inputStream ->
            inputStream.skip(position)
            //now we have an inputStream right on the needed position
            inputStream.readBytesIntoByteArray(buffer, wanted)
        }
        buf.put(buffer, 0, wanted)
        position += wanted
        return wanted
    }

    //not needed, because we don't store anything in memory
    override fun truncate(size: Long): SeekableByteChannel = this

    override fun write(src: ByteBuffer?): Int {
        //not needed, we read only
        throw  NotImplementedError()
    }
}

InefficientSeekableInUriByteChannel.kt

@RequiresApi(Build.VERSION_CODES.N)
class InefficientSeekableInUriByteChannel(someContext: Context, private val uri: Uri) : InefficientSeekableInputStreamByteChannel() {
    private val applicationContext = someContext.applicationContext

    override fun calculateSize(): Long = StreamsUtil.getStreamLengthFromUri(applicationContext, uri)

    override fun getNewInputStream(): InputStream = BufferedInputStream(
            applicationContext.contentResolver.openInputStream(uri)!!)
}

Usage:

val file = ...
val uri = Uri.fromFile(file)
parseUsingInefficientSeekableInUriByteChannel(uri)
...
    private fun parseUsingInefficientSeekableInUriByteChannel(uri: Uri): Boolean {
        Log.d("AppLog", "testing using SeekableInUriByteChannel (re-creating inputStream when needed) ")
        try {
            val startTime = System.currentTimeMillis()
            ZipFile(InefficientSeekableInUriByteChannel(this, uri)).use { zipFile: ZipFile ->
                val entriesNamesAndSizes = ArrayList<Pair<String, Long>>()
                for (entry in zipFile.entries) {
                    val name = entry.name
                    val size = entry.size
                    entriesNamesAndSizes.add(Pair(name, size))
                    Log.v("Applog", "entry name: $name - ${numberFormat.format(size)}")
                }
                val endTime = System.currentTimeMillis()
                Log.d("AppLog", "got ${entriesNamesAndSizes.size} entries data. time taken: ${endTime - startTime}ms")
                return true
            }
        } catch (e: Throwable) {
            Log.e("AppLog", "error while trying to parse using SeekableInUriByteChannel:$e")
            e.printStackTrace()
        }
        return false
    }

And here's my attempt in improving it, which didn't work in some cases:

SeekableInputStreamByteChannel.kt

@RequiresApi(Build.VERSION_CODES.N)
abstract class SeekableInputStreamByteChannel : SeekableByteChannel {
    private var position: Long = 0L
    private var actualPosition: Long = 0L
    private var cachedSize: Long = -1L
    private var inputStream: InputStream? = null
    private var buffer = ByteArray(DEFAULT_BUFFER_SIZE)
    abstract fun getNewInputStream(): InputStream

    override fun isOpen(): Boolean = true

    override fun position(): Long = position

    override fun position(newPosition: Long): SeekableByteChannel {
//        Log.d("AppLog", "position $newPosition")
        require(newPosition >= 0L) { "Position has to be positive" }
        position = newPosition
        return this
    }

    open fun calculateSize(): Long {
        return getNewInputStream().use { inputStream: InputStream ->
            if (inputStream is FileInputStream)
                return inputStream.channel.size()
            var bytesCount = 0L
            while (true) {
                val available = inputStream.available()
                if (available == 0)
                    break
                val skip = inputStream.skip(available.toLong())
                if (skip < 0)
                    break
                bytesCount += skip
            }
            bytesCount
        }
    }

    final override fun size(): Long {
        if (cachedSize < 0L)
            cachedSize = calculateSize()
//        Log.d("AppLog", "size $cachedSize")
        return cachedSize
    }

    override fun close() {
        inputStream.closeSilently().also { inputStream = null }
    }

    override fun read(buf: ByteBuffer): Int {
        var wanted: Int = buf.remaining()
//        Log.d("AppLog", "read wanted:$wanted")
        if (wanted <= 0)
            return wanted
        val possible = (calculateSize() - position).toInt()
        if (possible <= 0)
            return -1
        if (wanted > possible)
            wanted = possible
        if (buffer.size < wanted)
            buffer = ByteArray(wanted)
        var inputStream = this.inputStream
        //skipping to required position
        if (inputStream == null) {
            inputStream = getNewInputStream()
//            Log.d("AppLog", "getNewInputStream")
            inputStream.skip(position)
            this.inputStream = inputStream
        } else {
            if (actualPosition > position) {
                inputStream.close()
                actualPosition = 0L
                inputStream = getNewInputStream()
//                Log.d("AppLog", "getNewInputStream")
                this.inputStream = inputStream
            }
            inputStream.skip(position - actualPosition)
        }
        //now we have an inputStream right on the needed position
        inputStream.readBytesIntoByteArray(buffer, wanted)
        buf.put(buffer, 0, wanted)
        position += wanted
        actualPosition = position
        return wanted
    }

    //not needed, because we don't store anything in memory
    override fun truncate(size: Long): SeekableByteChannel = this

    override fun write(src: ByteBuffer?): Int {
        //not needed, we read only
        throw  NotImplementedError()
    }
}

SeekableInUriByteChannel.kt

@RequiresApi(Build.VERSION_CODES.N)
class SeekableInUriByteChannel(someContext: Context, private val uri: Uri) : SeekableInputStreamByteChannel() {
    private val applicationContext = someContext.applicationContext

    override fun calculateSize(): Long = StreamsUtil.getStreamLengthFromUri(applicationContext, uri)

    override fun getNewInputStream(): InputStream = BufferedInputStream(
            applicationContext.contentResolver.openInputStream(uri)!!)
}

The questions

Is there a way to improve it?

Maybe by having as little re-creation of InputStream as possible?

Are there more possible optimizations that will let it parse ZIP files well? Maybe some caching of the data?

I ask this because it seems it's quite slow compared to other solutions I've found, and I think this could help a bit.


Solution

  • The skip() method of BufferedInputStream doesn't always skip all the bytes that you specify. In SeekableInputStreamByteChannel change the following code

    inputStream.skip(position - actualPosition)
    

    to

    var bytesToSkip = position - actualPosition
    while (bytesToSkip > 0) {
        bytesToSkip -= inputStream.skip(bytesToSkip)
    }
    

    and that should work.

    Regarding making things more efficient, the first thing that ZipFile does is to zoom to the end of the file to get the central directory (CD). With the CD in hand, ZipFile knows the make up of the zip file. The zip entries should be in the same order as the files are laid out. I would read the files you want in that same order to avoid backtracking. If you can't guarantee the read order then maybe multiple input streams would make sense.