Search code examples
pythonxmlzipunzip

Python script unable to process larger zip files from IRS 990 AWS datalake


I am trying to access raw 990 (nonprofit tax returns) XML data through the AWS datalake.

The XML files are organized into Zip files split by the month in which the IRS processed them ("2024_1A" for January, "2024_2A" for Feburary). See photo for the Zip files I am iterating through. In certain months, there are multiple Zip files where there were too many returns for a single Zip file. Zip files I am trying to process

I have written a Python script that is able to access the Datalake and process the XML files for 10 out of 14 Zip files. However, for the very large Zips -- 5A, 5B, 11A, 11B -- it returns this error: "Error processing file EfileData/XmlZips/2024_TEOS_XML_05A.zip: That compression method is not supported"

The only differentiating factor between these Zips and the other Zips seems to be their size -- see the fixes/checks I've tried below. This is the code I'm using right now to unzip the files to a temporary directory. Does anyone have thoughts on why it would work for all the other zip files but not 5A, 5B, 11A, 11B, and how I can fix it? Thank you!

with zipfile.ZipFile(local_zip_path, 'r') as zip_ref:
    zip_ref.extractall(temp_dir)

Here's some of the fixes I've tried:

  • Using the pyz7r library and the 7zip tool through Python
  • Confirming the Zips are compression type 9, which corresponds to DEFLATE compression (zipfile.ZIP_DEFLATED) which should be supported by Python's zipfile module.
  • Downloaded the Zip files onto my computer and previewed the .XML files inside to confirm they're not corrupted.
  • Tried unzipping the Zip file onto my computer to then upload directly into the Google Colab workspace where I am working, but it was too large (70,000 files in each) and it took several hours to download just one of the 4 Zip files and it froze google chrome each time I tried to them upload it into Google colab.
  • Attempted to extract the Zip file in batches to be more efficient

Solution

  • Compression method 9 is not Deflate. (Deflate is method 8.) Method 9 is a proprietary PKWare enhancement of Deflate called Deflate64. Python's zipfile does not support it. You would need to use an unzip utility, such as Info-ZIP's unzip, 7-zip, or the like.

    When I try it, zipfile raises an error that says exactly that, which you are also seeing:

    raise NotImplementedError("That compression method is not supported")
    

    R1D3R175 notes in a comment below that there exists a zipfile-deflate64 project that can handle the Deflate64 method.

    These zip files were likely made using Windows' built-in compression tool, which elects to use Deflate64 when the sum of the file sizes is greater than 2 GB.