I am debugging a code for a cryptographic implementation on a Tricore TC275 from Infineon (reference assembly language).
PMI_PSPR (wx!p): org = 0xC0000000, len = 24K /*Scratch-Pad RAM (PSPR)*/
DMI_DSPR (w!xp): org = 0xD0000000, len = 112K /*Local Data RAM (DSPR)*/
The stack pointer a[10] always points to a reserved memory area after a call to the mac function.
###### typedefs ######
typedef uint16_t limb_t;
typedef limb_t gf_t[DIGITS]; //DIGITS=312
typedef int32_t dslimb_t;
################################
/**Multiply and accumulate c += a*b*/
void mac(gf_t c, const gf_t a, const gf_t b)
1: 0xC0000812: D9 AA 40 9F LEA a10,[a10]-0x9C0 //Load eff. addr.
/*Reference non-Karatsuba MAC */
dslimb_t accum[2*DIGITS] = {0};
2: 0xC0000816: 40 A2 MOV.AA a2,a10
3: 0xC0000818: D2 02 MOV e2,0x0 //move 0x0 to d2 and d3
4: 0xC000081A: C5 03 37 40 LEA a3,0x137 // 0.5*length of accum
5: 0xC000081E: 89 22 48 01 ST.D [a2+]0x8,e2 //<= fails here
6: 0xC0000822: FC 3E LOOP a3,0xC000081E
7: 0xC0000824: 40 AF MOV.AA a15,a10
###contents of relevant registers###
before after
1: a[10] D000 0600 CFFF FC40 (not definend in memory map?)
2: a[2] D000 0A06 CFFF FC40
3: d[2] 0000 0002 0000 0000
3: d[3] 0000 0000 0000 0000 (would have been set to zero too)
4: a[3] 0000 0186 0000 0137 (#of iterations in loop)
5: a[2] CFFF FC40 (store failed here)
value@CFFF FC40 ???? ???? ???? ???? (write is not allowed I guess)
0x9C0 = 2496 (base10)
and the length of the array accum is 624
, each element containing an int32_t
. Thus 624*4 = 2496 Bytes
get allocated or what?
But at this address in memory, no writes are allowed as far as I understand the memory map which is given to the linker... But the generated assembly code tries to do in line 5?
Does anybody know what I might be doing wrong here? I also tried to use calloc to allocate memory on the heap (instead of the stack like the code above does right?) but the programm still crashed.
I also copied the line dslimb_t accum[2*DIGITS] = {0}
to the start of the program where it was executed without an error.
Thank you very much for any help!
EDIT
mac is called like that, uniform samples some uniform random numbers
gf_t sk_expanded[DIM],b,c;
for (unsigned i=0; i<DIM; i++) {
noise(sk_expanded[i],ctx,i);
}
for (unsigned i=0; i<DIM; i++) {
noise(c,ctx,i+DIM); //noisy elements in c after call
for (unsigned j=0; j<DIM; j++) {
uniform(b,pk,i+DIM*j); //uniform random numbers in b after call
mac(c,b,sk_expanded[j]); //fails here on first call
}
contract(&pk[MATRIX_SEED_BYTES+i*GF_BYTES], c);
}
this code runs on my host machine, but on my tricore microcontroller it fails in the first mac() function call.
As the "stack pointer" a10
is 0xD0000600
before, and stack grows downward on this platform, and the memory chip assigned to this area starts at 0xD0000000
=> you have only 0x600 bytes of stack memory available for locals and other function calls (and their locals!).
Does anybody know what I might be doing wrong here?
But you are trying to allocate 0x9C0 bytes (plus few more for b
and c
, unless those end in registers, and the optimizer is smart enough to not allocate stack space for them), which leads to going outside of the designed memory area, and first write instruction then will crash. Actually if you would request many more bytes, you can accidentally start inside the Scratch-pad RAM (the resulting address being very close to 0xC0000000
), then the code will crash during clearing the array, once it would left the scratch-pad area.
But the generated assembly code tries to do in line 5?
The generated code doesn't check for memory availability in C, related to this kind of problems the C is "unsafe" programming language, and it's responsibility of programmer + maintainer/operator to build the code and run it in such environment where the stack has enough space. Or add checks into code which is so dynamic, that it's not possible to evaluate the stack usage during development and the code should handle full-stack situations gracefully.
I also tried to use calloc to allocate memory on the heap (instead of the stack like the code above does right?) but the programm still crashed.
Seems like different problem, or you have also full heap (from comment "heap should be 4k" - that sounds like very tiny heap, maybe you exhausted it with other dynamic allocations already, also fragmentation may prevent your memory allocator to return continuous valid 3k block for your array). Heap allocators tend to return NULL
when their pool is exhausted, but maybe your platform is so limited, that the memory allocator is missing such safety code in implementation, to make it smaller.
I also copied the line dslimb_t accum[2*DIGITS] = {0} to the start of the program where it was executed without an error.
Then it's global variable, which gets placed into the .data
-like segment, which is placed into sufficiently big memory area.
And yes, 624 32 bit integers needs at least 2496 (624*4) bytes of memory (in C language you usually pay zero price for abstraction, so in this case any piece of 2496 byte long memory, which is aligned as your platform requires, is enough to make this possible, in other languages like Java the total cost of such array is considerably higher, as there's also GC housekeeping and the array management data, so you can probably count about 3000-3500 bytes needed on such platform).
Usually when one develops on so much constrained system (asking for 3k of stack space for locals sounds as something completely negligible in desktop/web world of programming, but on small embedded systems or old 8/16 bit computers that may be serious amount of memory), it may help to design the code and algorithm in "data driven" way, i.e. you plan your memory usage completely, including where the code resides (and how big it can be), where the local/global variables are, and to be aware what is the maximum stack needed to run through all states of the code.
You can check why the stack is so low in the first place - the "local data RAM" seems to be ~110k big, so maybe you have enough space there, and there's some option during build to resize the stack (or linker script can be adjusted).
Actually you should check your whole memory consumption design, i.e. what data you really need to have in memory, where it is, which are temporary and what are they life cycle, etc (at least at rough kilobytes estimate), and check that against the physically available memory on the chip, so you can get the idea how carelessly you can write the code, or eventually if you are already out of memory for your particular task, even before the start of implementation. (you can start by checking the linker map file, to see how much of the code is produced, and how big are the fixed variables in .data/.bss/.rodata/etc
sections, then check all local variables and heap allocations)
Then maybe allocate the required memory in some kind of structs. Do you even need any dynamic allocation? Can't you simply design whole .data
segment already in the code as few global struct
variables, grouping various data by the abstraction where they belong, and use those globals in the other code, without any dynamic allocation at all?
Also if you are writing some kind of library/support function, make sure you don't exhaust all resources of the platform, otherwise it's not clear how one would use your functionality alongside their real task. :)