Search code examples
python-3.xapache-nififile-extension

Get the actual file extension using NiFi


I am using Apache NiFi to ingest data from Azure Storage. Now, the file I want to a huge file (100+ GB) read can have any extension and I want to read the file's header to get its actual extension.

I found python-magic package which uses libmagic to read the file's header to fetch the extension, but this requires the file to be present locally.

The NiFi pipeline to ingest the data looks like this enter image description here

I need a way to get the file extension in this NiFi pipeline. Is there a way to read the file's header from the Content Repo? If yes, how do we do it? FlowFile has only the metadata which says the content-type as text/plain for a CSV.


Solution

  • There is no such thing as a generic 'header' that all files have that gives you it's "real" extension. A file is just a collection of bits, and we sometimes choose to give extensions/headers/footers/etc so that we know how to interpret those bits.

    We tend to add that 'type' information in two ways, via a file extension e.g. .mp4 and/or via some metadata that accompanies the file - this is sometimes a header, which is sometimes plaintext and easily readible, but this is not always true. Additioanlly, it is up to the user and/or the application to set this information, and up the user and/or application to read it - neither of which are a given.

    If you do not trust that the file has the proper extension applied (e.g. video.txt when it's actually an mp4) then you could also try to interrogate the metadata that is held in Azure Blob Storage (ContentType) and see what that says - however, this is also up to the user/application to set when the file is uploaded to ABS, so there is no guarantee that it is any more accurate than the file extension.

    text/plain is not invalid for a plaintext CSV, as CSVs are just formatted plaintext - similar to JSON. However, you can be more specific and use e.g. text/csv for CSV and application/json for JSON.

    NiFi does have IndentifyMimeType which can try to work it out for you by interrogating the file, but it is more complex that just accessing some 'header'. This processor uses Apache Tika for the detection, and adds a mime.type attribute to the FlowFile.

    If your file is some kind of custom format, then this processor likely won't help you. If you know your files have a specific header, then you'll need to provide more information for your exact situation.