It seems fetch_add is a win (see post comments as well) against a CAS loop on CPUs that support both.
When toggling clear bit(s) to set bit(s) you can use either a bitwise or or an addition operation. The results will be identical. I expect performance for each would be equal. So the decision on which operation to use would hinge on the differences in hardware support for the operations (if any, I failed to turn up any information on relative processor support.)
Is there any reason to prefer one over the other in this case?
What you might want to do, instead of coding for a specific processor architecture, is to use a compiler intrinsic. GCC and Clang, for example, support several atomic builtins, one of which is __sync_fetch_and_or
.
Since Visual Studio 2005, Visual C++ has supported _InterlockedOr
on all architectures.