Search code examples
pythonparquetpyarrow

Check Parquet File Magic Number in Python


In Python we can validate a zip file using method zipfile.is_zipfile https://docs.python.org/2/library/zipfile.html

Similarly, I want to validate a third party Parquet File based on its Magic number before I consume it. Is there an API I can use to validate Parquet File based upon is Magic Number, and could be a a security risk if I don't validate


Solution

  • Typically, the magic number identifying different file types are the first four bytes of a file. The same is true for Parquet as well, but Parquet also writes the magic bytes at the end of the files, so you can check either (or both). The magic string at both locations is "PAR1".

    You can do this manually, but if you are using pyarrow, the validation of Parquet files automatically happens behind the scenes. You can check this with a simple experiment. First, try to load an actual Parquet file:

    >>> import pyarrow.parquet as pq
    >>> parquet_file = pq.ParquetFile('data.parquet')
    

    This operation succeeds and you can use parquet_file in any way you want, for example access its metadata as parquet_file.metadata. On the other hand, if you try to open a non-Parquet file, you get an error:

    >>> parquet_file = pq.ParquetFile('/etc/crontab')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/home/zi/.local/lib/python2.7/site-packages/pyarrow/parquet.py", line 128, in __init__
        self.reader.open(source, use_memory_map=memory_map, metadata=metadata)
      File "pyarrow/_parquet.pyx", line 640, in pyarrow._parquet.ParquetReader.open
      File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status
    pyarrow.lib.ArrowIOError: Invalid parquet file. Corrupt footer.
    

    Regarding the second part of your question, not checking the magic number is not a security risk, because if attackers can forge malicious files with the intent of triggering some vulnerability, they can just as easily do so while using the correct magic string at the same time. It's more like a question of how early you recognize that there is some problem with the file and how useful the error message will be.

    For example, if a code omits checking the magic bytes and immediately starts by reading the offset of the footer and then tries to read the footer from that offset, you may end up with a not-so-useful error message complaining about an invalid offset instead of a much more useful one complaining about wrong file type.