multithreading linux-kernel cpu cpu-cache smp

I have a cpu cache coherency-looking problem that I can't figure out how to fix. Two cpus see different contents of the same memory

I have a really weird problem I can't figure out, I haven't seen anything this unexplainable in my 30+ years of programming. Clearly I'm doing something wrong, but can't figure out what, and I can't even figure out a way around it.

I have a linux kernel module I've written that implements a block device. It calls out to userspace to supply the data for the block device via ioctl (as in the userspace program calls the kernel module via an ioctl to get block device requests)

Some technical information on the machines I'm testing on in case it matters:

It runs flawlessly on an intel core2 i7 somethingoroother.

> cat /proc/cpuinfo 
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 58
model name      : Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz
stepping        : 9
microcode       : 0x21
cpu MHz         : 1798.762
cache size      : 8192 KB
physical id     : 0
siblings        : 8
core id         : 0
cpu cores       : 4
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm cpuid_fault epb pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm arat pln pts md_clear flush_l1d
bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit
bogomips        : 7139.44
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

processor 1-7 are the same

It runs flawlessly on a raspberry pi 0

> cat /proc/cpuinfo 
processor       : 0
model name      : ARMv6-compatible processor rev 7 (v6l)
BogoMIPS        : 997.08
Features        : half thumb fastmult vfp edsp java tls 
CPU implementer : 0x41
CPU architecture: 7
CPU variant     : 0x0
CPU part        : 0xb76
CPU revision    : 7

Hardware        : BCM2835
Revision        : 920093
Serial          : 000000002d5dfda3

It runs flawlessly on a raspberry pi 3

> cat /proc/cpuinfo
processor       : 0
model name      : ARMv7 Processor rev 4 (v7l)
BogoMIPS        : 38.40
Features        : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm crc32
CPU implementer : 0x41
CPU architecture: 7
CPU variant     : 0x0
CPU part        : 0xd03
CPU revision    : 4

processor       : 1-3 are the same

Hardware        : BCM2835
Revision        : a02082
Serial          : 00000000e8f06b5e
Model           : Raspberry Pi 3 Model B Rev 1.2

But on my raspberry pi 4, it does something really weird that I can't explain, am really stumped about and I don't know how to fix.

> cat /proc/cpuinfo 
processor       : 0
model name      : ARMv7 Processor rev 3 (v7l)
BogoMIPS        : 270.00
Features        : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm crc32 
CPU implementer : 0x41
CPU architecture: 7
CPU variant     : 0x0
CPU part        : 0xd08
CPU revision    : 3

Hardware        : BCM2835
Revision        : c03111
Serial          : 10000000b970c9df
Model           : Raspberry Pi 4 Model B Rev 1.1

processor       : 1-3 are the same

So I'm asking for help from somebody who knows more about cpus, multithreading, cache coherency and memory barriers than I do. Maybe I'm barking up the wrong tree and you can tell me that, if that's the case. I'm pretty sure the program is okay, I've written lots of complicated multithreaded programs in my life. I've checked it over lots of times, had other people review it as well. This is the first multithreaded kernel module I've written though, so that's where I'm in new territory.

Here's what's going on:

I register a callback with blk_queue_make_request() that handles the read and write requests, I drop all the other ones, returning error (but I never actually get anything but read/write)

    log_kern_debug("bio operation is not read or write: %d", operation);
    bio->bi_status = BLK_STS_MEDIUM; 
    return BLK_QC_T_NONE;

I get the callback from the kernel, I iterate through the segments in the bio. For each segment, I make a request to the userspace application (in another thread) to service the read and write requests. (I'll explain how that works in a minute) and then the original requesting thread goes to sleep. When userspace returns with the data (for read) or success/failure (for write) it hands over the data, wakes up the original requesting thread, and then the original requesting thread returns the bio to the kernel, when all the segments have been serviced:

    bio_endio(bio); // complete the bio, the kernel does the followup callback to the next guy in the chain who wants this bio
    return BLK_QC_T_NONE;

The way the call to userspace works is this: first, the userspace program makes an ioctl call to the kernel module and the kernel module blocks. That thread stays blocked until a request comes in for the block device. The information about the request (read/write, start position, length, etc) gets copied to a userspace-provided buffer with copy_to_user and then the ioctl call is unblocked and returns. Userspace gets the request from the return of the ioctl, does the read or write then makes another ioctl call to the kernel module with the results of the request, and then wakes up the original requesting thread, so it can return the result in the make_request callback, and then the userspace ioctl blocks again waiting for the next request to come in.

So here's the problem. On the raspberry pi 4 only, every once in a while, not all the time, the contents of the memory passed between the two threads doesn't end up looking the same from both thread's point of view. As in when the data gets passed from the userspace-side thread to the original requesting thread (for a read request in this example), the hash of the data (at THE SAME LOCATION IN MEMORY!) is different. I assume this is a cpu cache coherency type-problem, except that I put in calls to mb(), smp_mb() and READ_ONCE() and WRITE_ONCE() and I even tried just plain old sleeping to give the original calling thread's cpu time to notice. It will reliably fail, but not all the time. I don't have any other raspberry pi 4's to test with, but I'm pretty sure the machine is fine because everything else works great. It's something I'm not doing right, but I don't know what.

What follows is a grep of the kern.log and an explanation showing what's going on. Every request going to userspace gets a transaction id. the start pos is the location in the block device to read from or write to. the length is the length of the bio segment to read/write, the crc32 column is the crc32 of the data in the bio segment buffer, (for the length listed, always 4k). The address column is the address of the bio segment buffer the data read from userspace is copied into (that the crc32 comes from) which always the same for a given transaction, and the last column is current->tid.

oper    trans id start pos        length           crc32            address  thread
write:  00000a2d 000000000001d000 0000000000001000 0000000010e5cad0          27240

read0:  00000b40 000000000001d000 0000000000001000 000000009b5eeca2 88314387 31415
read1:  00000b40 000000000001d000 0000000000001000 0000000010e5cad0 88314387 31392
read2:  00000b40 000000000001d000 0000000000001000 0000000010e5cad0 88314387 31415
readx:  00000b40 000000000001d000 0000000000001000 0000000010e5cad0 88314387 31392
read3:  00000b40 000000000001d000 0000000000001000 0000000010e5cad0 88314387 31415

read0:  00000c49 000000000001d000 0000000000001000 000000009b5eeca2 88314387 31417
read1:  00000c49 000000000001d000 0000000000001000 0000000010e5cad0 88314387 31392
read2:  00000c49 000000000001d000 0000000000001000 000000009b5eeca2 88314387 31417
readx:  00000c49 000000000001d000 0000000000001000 0000000010e5cad0 88314387 31392
read3:  00000c49 000000000001d000 0000000000001000 000000009b5eeca2 88314387 31417

read0:  00000d4f 000000000001d000 0000000000001000 000000009b5eeca2 88314387 31419
read1:  00000d4f 000000000001d000 0000000000001000 0000000010e5cad0 88314387 31392
read2:  00000d4f 000000000001d000 0000000000001000 0000000010e5cad0 88314387 31419
readx:  00000d4f 000000000001d000 0000000000001000 0000000010e5cad0 88314387 31392
read3:  00000d4f 000000000001d000 0000000000001000 0000000010e5cad0 88314387 31419

read0:  00000e53 000000000001d000 0000000000001000 000000009b5eeca2 1c6fcd65 31422
read1:  00000e53 000000000001d000 0000000000001000 0000000010e5cad0 1c6fcd65 31392
read2:  00000e53 000000000001d000 0000000000001000 0000000010e5cad0 1c6fcd65 31422
readx:  00000e53 000000000001d000 0000000000001000 0000000010e5cad0 1c6fcd65 31392
read3:  00000e53 000000000001d000 0000000000001000 0000000010e5cad0 1c6fcd65 31422

So the steps in the process are are follows, let's look at the first transaction, id b40 because that one worked correctly. Then we'll look at the second one c49 that didn't work. Transaction ids always increase, the logs above are in chronological order.

1) First the write comes in (trans id a2d) The crc32 of the data written is 10e5cad0. That's the crc32 we expect to see on all reads afterwards until the next write.

2) a read request comes in to the blk_queue_make_request callback handler on thread 31415. At this point I log ("read0") the crc32 of the contents of the bio segment buffer before it is written to, so I can see the before-it-changes value of the bio segment buffer at 88314387.

3) I call copy_to_user the information about the read request. return from the ioctl, userspace processes it, does an ioctl back into the kernel module with the resulting data and that data is copy_from_user()ed to the bio segment buffer (at 88314387). It logs ("read1") the crc32 of the bio segment buffer from userspace thread 31392's point of view. It is the expected 10e5cad0.

4) userspace wakes up the original requesting thread id 31415 now that the data is in the bio segment buffer at 88314387. thread 31415 calculates the crc32 again and logs ("read2") the the value it sees from 31415's point of view. Again as expected it is 10e5cad0.

5) For extra sanity checking (the reason for which will become clear in the next transaction) the userspace thread 31392 does the crc of the bio buffer at 8831487 again, and comes up with the expected value 10e5cad0 and logs it ("readx"). There is no reason it should change, nobody is updating it, it still says 10e5cad0.

6) as a final extra sanity check, the original requesting thread 31415 sleeps for 2ms, and calculates the crc32 again and logs it ("read3"). Everything worked, all is well.

Now let's look at the next transaction id c49. This is a case where the filesystem requested to read the same block twice. I forced this in my test with echo 3 > /proc/sys/vm/drop_caches. I'll start counting steps at 2 so the steps line up from the first example.

2) a read request comes in to the blk_queue_make_request callback handler on thread 31417. At this point I log ("read0") the crc32 of the contents of the bio segment buffer before we write to it. This is the same bio segment buffer from first transaction b40 (memory location 88314387), but apparently it's been written over since we last set it, and that's fine. It also seems to have been set to the same value as it was at the start of transaction b47, the crc32 value is 9b5eeca2. That's fine. We know the inital crc32 value of that bio segment buffer from thread id 31417's point of view before anybody writes to the buffer.

4) userspace wakes up the original requesting thread id 31417 now that the data SHOULD BE in the bio segment buffer at 88314387. Thread 31417 calculates the crc32 again and logs ("read2") the the value it sees from its (thread 31417's) point of view. But this time, the value is not the expected value 10e5cad0. Instead it is the same value (9b5eeca2) as it was before the request was sent to userspace to update the buffer. It is as if userspace didn't write to the buffer. But it did, because we read it, calculated the crc32 value and logged it in the userspace-side thread 31392. Same memory location, different thread, different perception of the contents of the bio segment buffer at 88314387. Different thread, presumably different cpu, and therefore different cpu cache. Even if I was screwing up the thread blocking and waking up the logs show the order of events, that one thread read the correct value after the other thread misread it.

5) Again the extra sanity checking, the userspace thread 31392 does the crc of the same bio buffer at 8831487 again, gets the same correct value of 10e5cad0 ("readx"). The logs are chronological, so thread id 31392 sees the correct value, after thread id 31417 saw the wrong value. Thread id 31392 comes up with the expected value 10e5cad0 and logs it ("readx").

6) as a final extra sanity check, the original requesting thread 31417 sleeps for 2ms, and calculates the crc32 again and logs it ("read3"), and it still sees the incorrect value 9b5eeca2.

Of the four read transaction I logged above, 1, 3 and 4 worked, and 2 did not. So I figured out, okay, it must be a cache coherency problem. But I added mb() and smp_mb() calls after read1 and before read2, and nothing changed.

I am stumped. I've read the linux kernel memory barrier page

https://www.kernel.org/doc/Documentation/memory-barriers.txt

a number of times, and I figured smp_mb() should just fix everything, but it does not.

I have no idea how to fix this. I can't even think of a lousy workaround. I set the contents of a memory location, and the other thread just doesn't see it. What do I do?

Help? Thanks.

Solution

So miracle of miracles I came across the answer totally by accident. I'd like to share in case anybody else comes across this and similarly bangs their heads for months.

In a totally unrelated change for another system I'm working on using this block driver, I made a change today and tried it on the pi4 and like magic it's all working.

What was the change? Not at all where I was looking....

So I register a callback with blk_queue_make_request not blk_init_queue. I don't handle the request queue, I handle the bios in the block request directly.

In so doing, you are told: https://www.kernel.org/doc/htmldocs/kernel-api/API-blk-queue-make-request.html

"The driver that does this must be able to deal appropriately with buffers in “highmemory”. This can be accomplished by either calling __bio_kmap_atomic to get a temporary kernel mapping, or by calling blk_queue_bounce to create a buffer in normal memory. "

Well, I had been achieving that with calls to kmap_atomic when I wanted to get the buffer. Today I read that there's a limited number of slots for those memory mappings and you should only call that if you're in an interrupt context and can't go to sleep, because the kmap_atomic call pulls from the reserved pile so it doesn't have to allocate on the call and possibly be put to sleep.

But my kernel module can sleep, so I changed the call to kmap() and... like magic... it's working.

So I'm thinking the failure cases were where kmap_atomic was failing and I wasn't catching or noticing, or possibly, something is wrong with kmap_atomic on the pi4 or on the interaction between the kernel in that case or something. I'll play more and see if I can tease out what's going on, but the trick is something's wrong with the way I was calling kmap_atomic.

After a little playing around...

Feb 25 21:12:46 pi205 kernel: [86149.193899][13905] kernel:    buffer after kmap_atomic ffefd000
Feb 25 21:12:46 pi205 kernel: [86149.193912][13905] kernel:    buffer after kmap        bfe84000

so when kmap_atomic returns a different value than kmap, is when the other thread doesn't see the memory correctly. I read something saying there's a per-cpu cache of these kmap_atomic mappings, which would explain that behaviour, if that's what it is.