Why is Exiftool slow, memory hog reading from stdin; fast, small reading from disk

I'm calling exiftool to extract XMP tags like Description from large videos, 5GB and over. My application is Python and I have seen some files which exhaust memory; I invoke it like:

fp = open('9502_UAS_2.mov', 'rb')
CMD = 'exiftool -api largefilesupport=1 -sort -a -S -G -struct -j -'
exiftool = subprocess.Popen(CMD.split(),
                            stdin=fp, 
                            stdout=subprocess.PIPE,
                            stderr=subprocess.PIPE)
(json_bytes, stderr) = exiftool.communicate()

To isolate the problem, I've tried variants on the CLI. This shows that reading from a file on disk is fast and uses little RAM, while reading from STDIN (re-creating the file pointer read above) is very slow and uses a lot of RAM (I've removed the output JSON metadata below for clarity):

time exiftool -api largefilesupport=1 -sort -a -S -G -struct -j 9502_UAS.mov
real    0m0.196s

time cat 9502_UAS.mov | exiftool -api largefilesupport=1 -sort -a -S -G -struct -j -
real    0m33.514s

'top' showed the second one consumed up to 1.4GB RAM on this 5.1GB video file.

I'd like to understand why reading from STDIN is slow and consumes so much memory, so I can watch out for limits like memory exhaustion on my servers. Is exiftool reading sequentially through the entire STDIN stream buffering the file until it gets the binary info it needs to parse the metadata? Is it not seek()-ing back and forward to find what it needs?

Conversely, why is running it against a native disk file so quick? Is exiftool using a memory mapped filesystem to quickly jump to the sections of the file it needs to parse?

Ideally, I'd read from STDIN because the real application's file origin is an AWS S3 bucket and I don't want to copy the file to local AWS EC2 disk if I can avoid it, so any hints to make reading stdin efficient would help.

Thanks.

Solution

Well, you are passing the whole content to stdin in the example. This takes time, of course. It would be better to pass the file name to the external tool:

CMD = 'exiftool -api largefilesupport=1 -sort -a -S -G -struct -j {}'
exiftool = subprocess.Popen(CMD.format('9502_UAS_2.mov').split(),
                            stdout=subprocess.PIPE,
                            stderr=subprocess.PIPE)
json_bytes, stderr = exiftool.communicate()

When passing to stdin, first the whole file will be piped to the program, and only after this process is done, does the process stop (regardless of the tool having done its job already).

When the file is on a remote server, you either need to run this script on that server, copy the file to a local file, or read in the first n bytes of the file and pass only these to the exiftool. (Determining how large n has to be left as an exercise...)