Search code examples
pythonpython-3.xiteratormmap

Python mmap.mmap() to bytes-like object?


The documentation for mmap says that "Memory-mapped file objects behave like both bytearray and like file objects."

However, that doesn't seem to extend to a standard for loop: At least for Python 3.8.5 on Linux which I'm currently using, each mmap.mmap() iterator element is a single-byte bytes, while for both a bytearray and for normal file access each element is an int instead. Update. Correction: For normal file access it is a variable-sized bytes; see below.

Why is that? And more importantly, how can I efficiently get a bytes-like object from an mmap, so one where not only indexing but also for gives me an int? (By efficiently, I mean that I'd like to avoid additional copying, casting, etc.)


Here is code to demonstrate the behavior:

#!/usr/bin/env python3.8

def print_types(desc, x):
    for el in setmm: break   ### UPDATE: bug here, `setmm` should be `x`, see comments
    # `el` is now the first element of `x`
    print('%-30s: type is %-30s, first element is %s' % (desc,type(x),type(el)))
    try: print('%72s(first element size is %d)' % (' ', len(el)))
    except: pass # ignore failure if `el` doesn't support `len()`

setmm = bytearray(b'hoi!')
print_types('bytearray', setmm)

with open('set.mm', 'rb') as f:
    print_types('file object', f)

with open('set.mm', 'rb') as f:
    setmm = f.read()
    print_types('file open().read() result', setmm)

import mmap
with open('set.mm', 'rb') as f:
    setmm = mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ)
    print_types('file mmap.mmap() result', setmm)

which results in

bytearray                     : type is <class 'bytearray'>           , first element type is <class 'int'>
file object                   : type is <class '_io.BufferedReader'>  , first element type is <class 'int'>
file open().read() result     : type is <class 'bytes'>               , first element type is <class 'int'>
file mmap.mmap() result       : type is <class 'mmap.mmap'>           , first element type is <class 'bytes'>
                                                                        (first element size is 1)

Update. With the bug fixed that furas kindly pointed out in the comments, the result becomes

bytearray                     : type is <class 'bytearray'>           , first element is <class 'int'>
file object                   : type is <class '_io.BufferedReader'>  , first element is <class 'bytes'>
                                                                        (first element size is 38)
file open().read() result     : type is <class 'bytes'>               , first element is <class 'int'>
file mmap.mmap() result       : type is <class 'mmap.mmap'>           , first element is <class 'bytes'>
                                                                        (first element size is 1)

Which answers what happens: For some reason iterating over an mmap is like iterating over a file, returning a bytes every time, but not with full lines like for a file, but single-byte chunks.

Still my main question is unchanged: How can I efficiently have an mmap behave like a bytes-like object (i.e., both indexing and for give int)?


Solution

  • How can I efficiently have an mmap behave like a bytes-like object (i.e., both indexing and for give int)?

    bytes is an object which contains data in memory. But the whole point of mmap is to not load all the data into memory.

    If you want to get a bytes object containing the entire content of a file, open() the file as normal and read() the entire content. Using mmap() for this is working against yourself.

    Perhaps you want to use memoryview, which can be constructed from bytes or mmap() and will give you a uniform API.