I have an InputStream of data that is the content of a file, but does not have any file information attached. I would like to be able to distinguish between cases when the data represents a *.zip file, and cases where it is a container file format (e.g. *.docx, *.odt, *.jar) that uses zip under the covers. I don't necessarily need to know what the container format is, just whether a stream is a "plain" zip or not (so I know whether it's appropriate to split the stream into separate files or not).
Is this possible? I'm happy to do the detection either after decompressing or before.
Ideally I'm trying to do this in Java, but if there are code examples in other languages then I'm happy to port them across if necessary.
There's no absolutely reliable and correct way to do this, because those formats that use the ZIP format as a container tend to be 100% valid and correct ZIP files.
So they are ZIP files.
However, since there's not an infinite number of those formats (and only a smaller subset of those are commonly found in the real world), you can probably get away with just specifically detecting those formats and treating everything that you don't recognize as a "real" ZIP file.
Most of these formats require some kind of easy-to-check identifier in the early bytes of the file, so if you are okay with writing specification-specific code it should be easy enough.
file
detects most of those formats correctly, so looking into its source should give you enough pointers.
Some examples:
It's also quite likely (haven't checked) that Apache Tika already does all that detection.