I've been trying to figure out how to deal with problematic ZIP files using a stream out of them.
The reasons for it:
Some ZIP files are not originated from a file-path. Some are from some Uri, some are even within another ZIP file.
Some ZIP files are quite problematic in opening, so together with the previous point, it is impossible to use just what the framework has to offer. Example of such files as "XAPK" files from APKPure website (example here).
As one of the possible solutions I've searched for, I asked about memory allocation via JNI, to hold the entire ZIP file inside the RAM, while using Apache's ZipFile class which can handle a zip file in various ways and not just via file-path.
Such a thing seems to work (here) very well, but it has some problems:
So, let's assume you can always create the InputStream (via Uri or from within an existing zip file), how could you parse it as a zip file?
I've made a working sample that can do it, by using Apache's ZipFile, and letting it traverse the Zip file as if it has all in memory.
Each time it asks to read some bytes from some position, I re-create the inputStream.
It works fine, but the problem with this is that I can't optimize it well to minimize the amount of times I re-create the inputStream. I tried to at least cache the current InputStream, and that if it's good enough, re-use it (skip from current position if needed), and if the required position is before current one, re-create the inputStream. Sadly it failed in some cases (such as the XAPK file I've mentioned above), as it causes EOFException.
Currently I've worked only with Uri, but a similar solution can be done for InputStream you re-generate from within another zip-file.
Here's the inefficient solution (sample available here, including both inefficient solution and the one I tried to make better), which seems to always work well:
InefficientSeekableInputStreamByteChannel.kt
@RequiresApi(Build.VERSION_CODES.N)
abstract class InefficientSeekableInputStreamByteChannel : SeekableByteChannel {
private var position: Long = 0L
private var cachedSize: Long = -1L
private var buffer = ByteArray(DEFAULT_BUFFER_SIZE)
abstract fun getNewInputStream(): InputStream
override fun isOpen(): Boolean = true
override fun position(): Long = position
override fun position(newPosition: Long): SeekableByteChannel {
// Log.d("AppLog", "position $newPosition")
require(newPosition >= 0L) { "Position has to be positive" }
position = newPosition
return this
}
open fun calculateSize(): Long {
return getNewInputStream().use { inputStream: InputStream ->
if (inputStream is FileInputStream)
return inputStream.channel.size()
var bytesCount = 0L
while (true) {
val available = inputStream.available()
if (available == 0)
break
val skip = inputStream.skip(available.toLong())
if (skip < 0)
break
bytesCount += skip
}
bytesCount
}
}
final override fun size(): Long {
if (cachedSize < 0L)
cachedSize = calculateSize()
// Log.d("AppLog", "size $cachedSize")
return cachedSize
}
override fun close() {
}
override fun read(buf: ByteBuffer): Int {
var wanted: Int = buf.remaining()
// Log.d("AppLog", "read wanted:$wanted")
if (wanted <= 0)
return wanted
val possible = (calculateSize() - position).toInt()
if (possible <= 0)
return -1
if (wanted > possible)
wanted = possible
if (buffer.size < wanted)
buffer = ByteArray(wanted)
getNewInputStream().use { inputStream ->
inputStream.skip(position)
//now we have an inputStream right on the needed position
inputStream.readBytesIntoByteArray(buffer, wanted)
}
buf.put(buffer, 0, wanted)
position += wanted
return wanted
}
//not needed, because we don't store anything in memory
override fun truncate(size: Long): SeekableByteChannel = this
override fun write(src: ByteBuffer?): Int {
//not needed, we read only
throw NotImplementedError()
}
}
InefficientSeekableInUriByteChannel.kt
@RequiresApi(Build.VERSION_CODES.N)
class InefficientSeekableInUriByteChannel(someContext: Context, private val uri: Uri) : InefficientSeekableInputStreamByteChannel() {
private val applicationContext = someContext.applicationContext
override fun calculateSize(): Long = StreamsUtil.getStreamLengthFromUri(applicationContext, uri)
override fun getNewInputStream(): InputStream = BufferedInputStream(
applicationContext.contentResolver.openInputStream(uri)!!)
}
Usage:
val file = ...
val uri = Uri.fromFile(file)
parseUsingInefficientSeekableInUriByteChannel(uri)
...
private fun parseUsingInefficientSeekableInUriByteChannel(uri: Uri): Boolean {
Log.d("AppLog", "testing using SeekableInUriByteChannel (re-creating inputStream when needed) ")
try {
val startTime = System.currentTimeMillis()
ZipFile(InefficientSeekableInUriByteChannel(this, uri)).use { zipFile: ZipFile ->
val entriesNamesAndSizes = ArrayList<Pair<String, Long>>()
for (entry in zipFile.entries) {
val name = entry.name
val size = entry.size
entriesNamesAndSizes.add(Pair(name, size))
Log.v("Applog", "entry name: $name - ${numberFormat.format(size)}")
}
val endTime = System.currentTimeMillis()
Log.d("AppLog", "got ${entriesNamesAndSizes.size} entries data. time taken: ${endTime - startTime}ms")
return true
}
} catch (e: Throwable) {
Log.e("AppLog", "error while trying to parse using SeekableInUriByteChannel:$e")
e.printStackTrace()
}
return false
}
And here's my attempt in improving it, which didn't work in some cases:
SeekableInputStreamByteChannel.kt
@RequiresApi(Build.VERSION_CODES.N)
abstract class SeekableInputStreamByteChannel : SeekableByteChannel {
private var position: Long = 0L
private var actualPosition: Long = 0L
private var cachedSize: Long = -1L
private var inputStream: InputStream? = null
private var buffer = ByteArray(DEFAULT_BUFFER_SIZE)
abstract fun getNewInputStream(): InputStream
override fun isOpen(): Boolean = true
override fun position(): Long = position
override fun position(newPosition: Long): SeekableByteChannel {
// Log.d("AppLog", "position $newPosition")
require(newPosition >= 0L) { "Position has to be positive" }
position = newPosition
return this
}
open fun calculateSize(): Long {
return getNewInputStream().use { inputStream: InputStream ->
if (inputStream is FileInputStream)
return inputStream.channel.size()
var bytesCount = 0L
while (true) {
val available = inputStream.available()
if (available == 0)
break
val skip = inputStream.skip(available.toLong())
if (skip < 0)
break
bytesCount += skip
}
bytesCount
}
}
final override fun size(): Long {
if (cachedSize < 0L)
cachedSize = calculateSize()
// Log.d("AppLog", "size $cachedSize")
return cachedSize
}
override fun close() {
inputStream.closeSilently().also { inputStream = null }
}
override fun read(buf: ByteBuffer): Int {
var wanted: Int = buf.remaining()
// Log.d("AppLog", "read wanted:$wanted")
if (wanted <= 0)
return wanted
val possible = (calculateSize() - position).toInt()
if (possible <= 0)
return -1
if (wanted > possible)
wanted = possible
if (buffer.size < wanted)
buffer = ByteArray(wanted)
var inputStream = this.inputStream
//skipping to required position
if (inputStream == null) {
inputStream = getNewInputStream()
// Log.d("AppLog", "getNewInputStream")
inputStream.skip(position)
this.inputStream = inputStream
} else {
if (actualPosition > position) {
inputStream.close()
actualPosition = 0L
inputStream = getNewInputStream()
// Log.d("AppLog", "getNewInputStream")
this.inputStream = inputStream
}
inputStream.skip(position - actualPosition)
}
//now we have an inputStream right on the needed position
inputStream.readBytesIntoByteArray(buffer, wanted)
buf.put(buffer, 0, wanted)
position += wanted
actualPosition = position
return wanted
}
//not needed, because we don't store anything in memory
override fun truncate(size: Long): SeekableByteChannel = this
override fun write(src: ByteBuffer?): Int {
//not needed, we read only
throw NotImplementedError()
}
}
SeekableInUriByteChannel.kt
@RequiresApi(Build.VERSION_CODES.N)
class SeekableInUriByteChannel(someContext: Context, private val uri: Uri) : SeekableInputStreamByteChannel() {
private val applicationContext = someContext.applicationContext
override fun calculateSize(): Long = StreamsUtil.getStreamLengthFromUri(applicationContext, uri)
override fun getNewInputStream(): InputStream = BufferedInputStream(
applicationContext.contentResolver.openInputStream(uri)!!)
}
Is there a way to improve it?
Maybe by having as little re-creation of InputStream as possible?
Are there more possible optimizations that will let it parse ZIP files well? Maybe some caching of the data?
I ask this because it seems it's quite slow compared to other solutions I've found, and I think this could help a bit.
The skip()
method of BufferedInputStream doesn't always skip all the bytes that you specify. In SeekableInputStreamByteChannel change the following code
inputStream.skip(position - actualPosition)
to
var bytesToSkip = position - actualPosition
while (bytesToSkip > 0) {
bytesToSkip -= inputStream.skip(bytesToSkip)
}
and that should work.
Regarding making things more efficient, the first thing that ZipFile does is to zoom to the end of the file to get the central directory (CD). With the CD in hand, ZipFile knows the make up of the zip file. The zip entries should be in the same order as the files are laid out. I would read the files you want in that same order to avoid backtracking. If you can't guarantee the read order then maybe multiple input streams would make sense.