Search code examples
pythonzipsoftware-distribution

How does the Python interpreter detect that it was called with a ZIP archive instead of a source file?


I just found out that (A) a ZIP file can be passed directly as the script parameter (where normally a .py file would be passed) of the Python binary and (B) the ZIP file can have any suffix, even .py to get recognized as a ZIP file (at least on Mac OS X from the command line and on Windows from the command line and from the GUI, it seems to work). The whole story of implementing this is documented in this issue.

This seems very appealing for the distribution of Python applications where an installer is undesirable and it has the same usage characteristics of a .jar archive (no installation required, can be sent by email without further archiving) to which our users are accustomed. Naming the ZIP archive .py (or .pyw) enables this behavior without any configuration on the client machine aside from installing Python.

My problem is that I can only find documentation of part (A) of my findings but not for part (B). So my first question is how is Python detecting that a file passed as the script parameter is a ZIP archive and not a Python source file? Are there any heuristics involved that may break randomly e.g. when the ZIP archive contains some special content (e.g. an uncompressed file that looks like Python code)?

The second question is whether there are any drawbacks to this approach when the application is carrying around a lot of non-code data files (tens of MBs), aside from the fact that access to these files is not transparent. I'm thinking about the ZIP file detection taking longer if the ZIP file is large and/or contains a lot of files.

Update

All answers up to now (Joachim Sauer's, Keith Randall's and Curious's) are sadly all wrong. The Zip specification does not mandate that a ZIP file must start with a specific header. A Zip file can have any data prepended to it and still be a valid Zip file (this is how self-extracting Zip files work where the file starts with a windows EXE header and not anything Zip-specific). This is explained in the page linked in Curious's answer.

I'm guessing that the Python interpreter looks for the Zip central directory and if there is one, the file is used as a Zip file instead of a Python source file. Does anyone want to include this in his/her answer so I can accept it?


Solution

  • I wondered the same and found:

    You are correct that prepended data is allowed, and this is explicitly mentioned in the docs:

    Python has been able to execute zip files which contain a __main__.py file since version 2.6. In order to be executed by Python, an application archive simply has to be a standard zip file containing a __main__.py file [...]

    The zip file format allows arbitrary data to be prepended to a zip file.

    You are also correct in guessing that Python looks for a ZIP central directory. This happens in zipimport.py, which looks for STRING_END_ARCHIVE = b'PK\x05\x06' near the end of the file.

    The contents of the archive, such as uncompressed Python code files, does therefore not affect the detection of the zip file.

    A demonstation:

    $ echo 'print("hello")' > script.py
    $ python script.py
    hello
    $ echo 'print("hi")' > __main__.py
    $ zip app.zip __main__.py
      adding: __main__.py (stored 0%)
    $ dd if=app.zip >> script.py
    0+1 records in
    0+1 records out
    184 bytes transferred in 0.000066 secs (2786108 bytes/sec)
    $ zip -A script.py
    Zip entry offsets appear off by 15 bytes - correcting...
    $ head -n 1 script.py 
    print("hello")
    $ unzip -l script.py 
    Archive:  script.py
      Length      Date    Time    Name
    ---------  ---------- -----   ----
           12  08-04-2022 23:02   __main__.py
    ---------                     -------
           12                     1 file
    $ python script.py 
    hi