Python: 'Unmask' a long XOR'd string of data

As part of the WebSocket spec, all client-sent frames must have the payload part of the frames masked using a 4-byte mask. In C++, this would be really easy :

for (size_t i = 0; i < length; i++) {
    data[i] ^= mask[i % 4];
}

Sadly, Python strings are immutable and I'd rather avoid having to do something like this, because of the constant copying and re-creation of the string buffer :

frame = ''
for i in range(0, length-1):
    frame += chr(ord(oldFrame[i]) ^ ord(mask[i % 4]))

So after some research, I found this one :

m = itertools.cycle(mask)
frame = ''.join(chr(ord(x) ^ ord(y)) for (x,y) in itertools.izip(oldFrame, m))

In CPython, that halved the time needed for the unmasking. In PyPy, for a 16MB masked string, this easily grows to 1.5GB RAM usage, after which it starts swapping and I have to kill it. CPython uses 'only' 150 MB RAM for this 16MB string (and takes 20 seconds), but that's still bad. In comparison, my C++ benchmark did it in 0.05 seconds with no memory overhead.

Of course these are extremes that won't actually happen once the software goes in production mode (all incoming data is capped at 10KB), but I would really like to have a good score on this benchmark.

Any ideas? The only requirements are: fast on both CPython and PyPy, and low memory usage. The original string doesn't have to be preserved.

Some test code for those who want to experiment:

import os, time

frame = os.urandom(16 << 20)
mask = os.urandom(4)

def unmask(oldFrame, mask):
    # Do your magic
    return newFrame

for i in range(0, 3): # Run several times, to help PyPy's JIT compiler
    startTime = time.time()
    f = unmask(frame, mask)
    endTime = time.time()
    print 'This run took %.3f seconds' % (endTime - startTime)

Solution

Use a bytearray instead, which is mutable.

frame = bytearray(frame)
for i in range(len(mask)):
    frame[i] ^= mask[i]