I am trying to access raw 990 (nonprofit tax returns) XML data through the AWS datalake.
The XML files are organized into Zip files split by the month in which the IRS processed them ("2024_1A" for January, "2024_2A" for Feburary). See photo for the Zip files I am iterating through. In certain months, there are multiple Zip files where there were too many returns for a single Zip file.
I have written a Python script that is able to access the Datalake and process the XML files for 10 out of 14 Zip files. However, for the very large Zips -- 5A, 5B, 11A, 11B -- it returns this error: "Error processing file EfileData/XmlZips/2024_TEOS_XML_05A.zip: That compression method is not supported"
The only differentiating factor between these Zips and the other Zips seems to be their size -- see the fixes/checks I've tried below. This is the code I'm using right now to unzip the files to a temporary directory. Does anyone have thoughts on why it would work for all the other zip files but not 5A, 5B, 11A, 11B, and how I can fix it? Thank you!
with zipfile.ZipFile(local_zip_path, 'r') as zip_ref:
zip_ref.extractall(temp_dir)
Here's some of the fixes I've tried:
Compression method 9 is not Deflate. (Deflate is method 8.) Method 9 is a proprietary PKWare enhancement of Deflate called Deflate64. Python's zipfile does not support it. You would need to use an unzip utility, such as Info-ZIP's unzip, 7-zip, or the like.
When I try it, zipfile raises an error that says exactly that, which you are also seeing:
raise NotImplementedError("That compression method is not supported")
R1D3R175 notes in a comment below that there exists a zipfile-deflate64 project that can handle the Deflate64 method.
These zip files were likely made using Windows' built-in compression tool, which elects to use Deflate64 when the sum of the file sizes is greater than 2 GB.