Handling compressed XML documents with dom4j

Specifically, I was using dom4j to read in KML documents and parsing out some of the data in the XML. When I just pass in the URL in string form to the reader, it's so simple and handles both file system URLs and web URLs:

SAXReader reader = new SAXReader();
Document document = reader.read(url);

The problem is, sometimes my code will need to handle KMZ documents, which are basically just zipped up XML (KML) documents. Unfortunately, there's no convenient way to handle this with the SAXReader. I've found all kinds of funky solutions to determining if any given file is a ZIP file, but my code quickly becomes blown up and nasty -- reading the stream, building a file, checking the "magic" hex bytes at the beginning, extracting, etc.

Is there some quick and clean way to handle this? An easier way to connect to any URL and extract the contents if they're compressed, otherwise simply grab the XML?

Solution

Hmm, it doesn't seem the KMZDOMLoader handles kmz files on the web. It's possible that the kmz is being loaded dynamically so it won't always have a) a file reference or b) a .kmz extension specifically -- it'll have to determine by content type.

What I ended up doing was to build a URL object, then get the protocol. I have separate logic to handle a local file or a document on the web. Then inside each of those logic blocks, I had to determine if it was compressed. The SAXReader read() method takes an input stream, so I found that I could use a ZipInputStream for the kmzs.

Here's the code I ended up with:

private static final long ZIP_MAGIC_NUMBERS = 0x504B0304;
private static final String KMZ_CONTENT_TYPE = "application/vnd.google-earth.kmz";

private Document getDocument(String urlString) throws IOException, DocumentException, URISyntaxException {
        InputStream inputStream = null;
        URL url = new URL(urlString);
        String protocol = url.getProtocol();

        /*
         * Figure out how to get the XML from the URL -- there are 4 possibilities:
         * 
         * 1)  a KML (uncompressed) doc on the filesystem
         * 2)  a KMZ (compressed) doc on the filesystem
         * 3)  a KML (uncompressed) doc on the web
         * 4)  a KMZ (compressed) doc on the web
         */
        if (protocol.equalsIgnoreCase("file")) {
            // the provided input URL points to a file on a file system
            File file = new File(url.toURI());
            RandomAccessFile raf = new RandomAccessFile(file, "r");
            long n = raf.readInt();
            raf.close();

            if (n == KmlMetadataExtractorAdaptor.ZIP_MAGIC_NUMBERS) {
                // the file is a KMZ file
                inputStream = new ZipInputStream(new FileInputStream(file));
                ((ZipInputStream) inputStream).getNextEntry();
            } else {
                // the file is a KML file
                inputStream = new FileInputStream(file);
            }

        } else if (protocol.equalsIgnoreCase("http") || protocol.equalsIgnoreCase("https")) {
            // the provided input URL points to a web location
            HttpURLConnection connection = (HttpURLConnection) url.openConnection();
            connection.connect();

            String contentType = connection.getContentType();

            if (contentType.contains(KmlMetadataExtractorAdaptor.KMZ_CONTENT_TYPE)) {
                // the target resource is KMZ
                inputStream = new ZipInputStream(connection.getInputStream());
                ((ZipInputStream) inputStream).getNextEntry();
            } else {
                // the target resource is KML
                inputStream = connection.getInputStream();
            }

        }

        Document document = new SAXReader().read(inputStream);
        inputStream.close();

        return document;
    }