Usermode CPU Data cache invalidation / flush on Linux (Cortex A53)

I would like to find a way for data (L1/L2) cache invalidation and flush in usermode on a Linux platform working with heterogenous non coherent caches (ARM A53 and ARM M7 cores). My problem lies on the A53 side, where an SMP Linux is running. M7 cores run a bare metal program on which I already implemented the flush / invalidation.

I know it is possible to write a kernel module for that but before doing this, I would like to know if there is any hidden API to do so? I am running on Linux 5.10.120. If not, what would be the most performant way to perform the flush / invalidation?

I used __builtin___clear_cache, however, I found out this was for instructions only. Also I would like to have a better granularity on the operation to avoid invalidating line when flushing if not needed. Also, the cacheflush function in asm/cachectl.h is not available (the header is not present for my CPU / target).

Disclaimer: This question was asked many times. Most answers try to be smart by telling "You don't need to do that...". Please refrain from answering something like this as the context in which I am asking this question requires data cache flush / invalidation.

Solution

If you are running 32-bit user-space software, then you are out of luck. You cannot access data cache maintenance ops from user-space.

(Updated) If you are running in Aarch64 64-bit mode, AND if the kernel enabled access (SCTLR_EL1.UCI and SCTLR_EL2.UCI set to 1), then you can access a few of the by-VA data cache maintenance instructions (DC CVAU, DC CIVAC, and DC CVAC).

I used __builtin___clear_cache, ... I would like to have a better granularity on the operation to avoid invalidating line when flushing if not needed.

Better granularity than what? builtin___clear_cache lets you specify start and end, so what more granularity do you need?

For the "by VA" instructions you can enable access to from user-space, you have no choice by to iterate and clean/invalidate line-by-line for every 64 byte chunk of your buffer ...

Depending on the amount of data your are transfering, it's often better to set up a non-cached DMA region to avoid the cost of the cache maintenance.