I'm running the code below in a Linux kernel module on x86_64. Basically, I iterate over 256 bits, and for each bit that is set to 1, I clear it and perform a certain action. The code below, however, takes a few thousand cycles to run (and the bottleneck is not the code executed within the body of the if statement).
unsigned long *data = ...;
for (int i = 0; i < 256; i++) {
//test_and_clear_bit is a function defined in the Linux kernel
if (test_and_clear_bit(i, data)) {
//bit i was set, so do something
}
}
The bottleneck seems to be the test_and_clear_bit
function. The data I'm iterating over is a hardware-defined data structure that I may only mutate with read-modify-write instructions (according to the Intel manual). This is because the processor may attempt to mutate the data structure at the same time. Thus, I can't fall back on a simple solution such as protecting the data structure with a single spinlock and then just reading and clearing the bits with non-atomic instructions.
Is there a faster way to do this?
This is a difficult question to answer because we don't know exactly what this data is, and because of this statement:
The data I'm iterating over is a hardware-defined data structure that I may only mutate with read-modify-write instructions (according to the Intel manual).
That said, the best we can do is general ideas/recommendations, which may or not apply to your specific situation. You can try the following:
data
into a local buffer and iterate over the bits, only calling test_and_clear_bit
if the bit was set in the local buffer. This will avoid calling test_and_clear_bit
on bits which are not already set on the local buffer. Obviously bits which are not set in the local buffer might have been set between the time of the structure's copy and execution, but if that is an acceptable loss, this will likely speed up the loop significantly.test_and_clear_bit
implementation with volatile
, as @IlyaBursov mentions in a comment. This is not guaranteed to work, and what might work on one platform or compiler might not an another. However, if you're using a hardware-defined memory structure, a platform-specific solution might work for you. Note that volatile
is unlikely to guard against this processor modifying the bits, but on some platforms (if you are lucky, yours) it very well might be. As mentioned here:
As a result, most implementations do not insert sufficient memory fences to guarantee that other threads, or even hardware devices, see volatile operations in the order in which they were issued
On some platforms, some limited ordering guarantees are provided, either because they are automatically enforced by the underlying hardware or, as on Itanium, because different instructions are generated for volatile references. But the specific rules are highly platform dependent. And even when they are specified for a specific platform, they may be inconsistently implemented.