Faster test_and_clear_bit

I'm running the code below in a Linux kernel module on x86_64. Basically, I iterate over 256 bits, and for each bit that is set to 1, I clear it and perform a certain action. The code below, however, takes a few thousand cycles to run (and the bottleneck is not the code executed within the body of the if statement).

unsigned long *data = ...;
for (int i = 0; i < 256; i++) {
    //test_and_clear_bit is a function defined in the Linux kernel
    if (test_and_clear_bit(i, data)) {
        //bit i was set, so do something
    }
}

The bottleneck seems to be the test_and_clear_bit function. The data I'm iterating over is a hardware-defined data structure that I may only mutate with read-modify-write instructions (according to the Intel manual). This is because the processor may attempt to mutate the data structure at the same time. Thus, I can't fall back on a simple solution such as protecting the data structure with a single spinlock and then just reading and clearing the bits with non-atomic instructions.

Is there a faster way to do this?

Solution

This is a difficult question to answer because we don't know exactly what this data is, and because of this statement:

The data I'm iterating over is a hardware-defined data structure that I may only mutate with read-modify-write instructions (according to the Intel manual).

That said, the best we can do is general ideas/recommendations, which may or not apply to your specific situation. You can try the following:

Copy data into a local buffer and iterate over the bits, only calling test_and_clear_bit if the bit was set in the local buffer. This will avoid calling test_and_clear_bit on bits which are not already set on the local buffer. Obviously bits which are not set in the local buffer might have been set between the time of the structure's copy and execution, but if that is an acceptable loss, this will likely speed up the loop significantly.
Test multiple bits at a time, if possible. As @immibis mentions in a comment, if you can check 8, 16, 32, or 64 bits at a time, then only test individual bits if you get a response from a multi-bit set. If it's likely that at least one bit in every 8 or more is set, then this won't work and will actually slow down the loop as it adds an extra unnecessary call.
Attempt your own test_and_clear_bit implementation with volatile, as @IlyaBursov mentions in a comment. This is not guaranteed to work, and what might work on one platform or compiler might not an another. However, if you're using a hardware-defined memory structure, a platform-specific solution might work for you. Note that volatile is unlikely to guard against this processor modifying the bits, but on some platforms (if you are lucky, yours) it very well might be. As mentioned here:

As a result, most implementations do not insert sufficient memory fences to guarantee that other threads, or even hardware devices, see volatile operations in the order in which they were issued

On some platforms, some limited ordering guarantees are provided, either because they are automatically enforced by the underlying hardware or, as on Itanium, because different instructions are generated for volatile references. But the specific rules are highly platform dependent. And even when they are specified for a specific platform, they may be inconsistently implemented.