java google-cloud-platform google-cloud-storage tar archive

Corrupted TAR File Error Upon Access From Google Cloud Storage in Java

I am storing a TAR file in Google Cloud Storage. The file can be successfully downloaded via gsutil and extracted in my computer using macOS Archive Utility. However, the Java program that I implement always encounter java.io.IOException: Corrupted TAR archive upon accessing the file. I have tried several ways and all of them are utilizing the org.apache.commons:commons-compress library. Can you give me insight on how to fix this problem or something that I can try on?

Here are the implementations that I have tried:

Blob blob = storage.get(BUCKET_NAME, FILE_PATH);
blob.downloadTo(Paths.get("filename.tar"));
String contentType = blob.getContentType(); // application/x-tar

InputStream is = Channels.newInputStream(blob.reader());
String mime = URLConnection.guessContentTypeFromStream(is); // null
TarArchiveInputStream ais = new TarArchiveInputStream(is);
ais.getNextEntry(); // raise java.io.IOException: Corrupted TAR archive

InputStream is2 = new ByteArrayInputStream(blob.getContent());
String mime2 = URLConnection.guessContentTypeFromStream(is2); // null
TarArchiveInputStream ais2 = new TarArchiveInputStream(is2);
ais2.getNextEntry(); // raise java.io.IOException: Corrupted TAR archive

InputStream is3 = new FileInputStream("filename.tar");
String mime3 = URLConnection.guessContentTypeFromStream(is3); // null
TarArchiveInputStream ais3 = new TarArchiveInputStream(is3);
ais3.getNextEntry(); // raise java.io.IOException: Corrupted TAR archive

TarFile file = new TarFile(blob.getContent()); // raise java.io.IOException: Corrupted TAR archive
TarFile tarFile = new TarFile(Paths.get("filename.tar")); // raise java.io.IOException: Corrupted TAR archive

Addition: I have tried to parse a JSON from GCS and it's working fine.

Blob blob = storage.get(BUCKET_NAME, FILE_PATH);
JSONTokener jt = new JSONTokener(Channels.newInputStream(blob.reader()));
JSONObject jo = new JSONObject(jt);

Solution

The problem is that your tar is compressed, it is a tgz file.

For that reason, you need to decompress the file when processing your tar contents.

Please, consider the following example; note the use of the common compress builtin GzipCompressorInputStream class:

public static void main(String... args) {
  final File archiveFile = new File("latest.tar");
  try (
      FileInputStream in = new FileInputStream(archiveFile);
      GzipCompressorInputStream gzIn = new GzipCompressorInputStream(in);
      TarArchiveInputStream tarIn = new TarArchiveInputStream(gzIn)
  ) {
    TarArchiveEntry tarEntry = tarIn.getNextTarEntry();
    while (tarEntry != null) {
      final File path = new File("/tmp/" + File.separator + tarEntry.getName());
      if (!path.getParentFile().exists()) {
        path.getParentFile().mkdirs();
      }

      if (!tarEntry.isDirectory()) {
        try (OutputStream out = new FileOutputStream(path)){
          IOUtils.copy(tarIn, out);
        }
      }
      tarEntry = tarIn.getNextTarEntry();
    }
  } catch (FileNotFoundException e) {
    e.printStackTrace();
  } catch (IOException e) {
    e.printStackTrace();
  }
}