Is reinterpret_cast bad when dealing with low-level byte manipulation?

I'm writing a websocket server and I have to deal with masked data that I need to unmask.

The mask is unsigned char[4], and the data is a unsigned char* buffer as well.

I don't want to XOR byte by byte, I'd much rather XOR 4-bytes at a time.

uint32_t * const end = reinterpret_cast<uint32_t *>(data_+length);
for(uint32_t *i = reinterpret_cast<uint32_t *>(data_); i != end; ++i) {
    *i ^= mask_;
}

Is there anything wrong with the use of reinterpret_cast in this situation?

The alternative would be the following code which isn't as clear and not as fast:

uint64_t j = 0;
uint8_t *end = data_+length;
for(uint8_t *i = data_; i != end; ++i,++j) {
    *i ^= mask_[j % 4];
}

I'm all ears for alternatives, including ones dependent on c++11 features.

Solution

The are a few potential problems with the approach posted:

On some systems objects of a type bigger than char needs to be aligned properly to be accessible. A typical requirement for uint32_t is that the object is aligned to an address divisible by four.
If length / sizeof(uint32_t) != 0 the loop may never terminate.
Depending on the endianess of the system mask needs to contain different values. If mask is produced by *reinterpret_cast<uint32_t>(char_mask) of a suitable array this shouldn't be an array.

If these issues are taken care of, reinterpret_cast<...>(...) can be used in the situation you have. Reinterpreting the meaning of pointers is one of the reasons this operation is there and sometimes it is needed. I would create a suitable test case to verify that it works properly, though, to avoid having to hunt down problems when porting the code to a different platform.

Personally I would go with a different approach until profiling shows that it is too slow:

char* it(data);
if (4 < length) {
    for (char* end(data + length - 4); it < end; it += 4) {
        it[0] ^= mask_[0];
        it[1] ^= mask_[1];
        it[2] ^= mask_[2];
        it[3] ^= mask_[3];
    }
}
it != data + length && *it++ ^= mask_[0];
it != data + length && *it++ ^= mask_[1];
it != data + length && *it++ ^= mask_[2];
it != data + length && *it++ ^= mask_[3];

I'm definitely using a number of similar approaches in software which meant to be really faster and haven't found them to be a notable performance problem.