Search code examples
linuxarmfpgasigbus

SIGBUS on ARM Cortex when accessing external device


I have a Zynq UltraScale MPSoC with a Quad core ARM Cortex on it running Linux. Occasionally, there is an event that is generating a SIGBUS error. I have included a snippet of the debug analysis below. I have been assured that the values for dst and src are in legitimate regions. The actual access itself is a copy routine from an FPGA memory resource to an internal ARM memory location.

I have read in another post that a cause of SIGBUS in can be I/O failure. Can anyone expand on what "I/O failure" is relative to an ARM? I envision, something akin to a failed bus acknowledge.

Relative to an ARM Cortex, is there an equivalent to a machine check register that might provide further insight into the cause of a SIGBUS?

#0 ecfm_copy_table_entry_backward (dst=dst@entry=0xee189830, src=<optimized out>, num_words=num_words@entry=72) at src/software/saos-sds/ecfm_driver/ecfm_driver.c:478
#1 0xf658347c in ecfm_copy_table_entry_backward (num_words=72, src=<optimized out>, dst=0xee189830) at src/software/saos-sds/ecfm_driver/ecfm_driver.c:1186
#2 ecfm_get_rx_stats (session_id=session_id@entry=2637, stat=stat@entry=0xee189830) at src/software/saos-sds/ecfm_driver/ecfm_driver.c:1185
#3 0x011c463c in eCfmApiGetRxProcessingStats (sessionId=<optimized out>, stat=0xee1898e0) at src/software/saos-sds/leos/platform/common/src/eCfmApi.c:1836
#4 0x011d6aac in halFPGAGetStats (pPlatformData=pPlatformData@entry=0xe7f98abc, lossStats=0xee1899a8, lossStats@entry=0xee1899a0)
at src/software/saos-sds/leos/platform/common/src/halEcfmFpgaApi.c:2214
#5 0x00a71870 in cfmAgentReadHwStats (data=0xe7f98a70, role=<optimized out>, testType=<optimized out>) at src/software/saos-sds/leos/common/src/genericSwitch/cfm/src/cfmApal.c:1760
#6 0x009fd39c in cfmTestSessionSmiSmEvent (pSession=0xe7f98a70, event=event@entry=CfmTestSmiEvent_DeltaTComplete)
at src/software/saos-sds/leos/common/src/genericSwitch/cfm/src/cfm.c:26242
#7 0x00a75f04 in cfmApalOamFpgaSessionStatusIntHdlr (context=<optimized out>, pMsg=<optimized out>) at src/software/saos-sds/leos/common/src/genericSwitch/cfm/src/cfmApal.c:2461
#8 cfmApalOamFpgaHalY1731IntHdlr (context=<optimized out>, pMsg=0x37b7768 <__func__.44940>) at src/software/saos-sds/leos/common/src/genericSwitch/cfm/src/cfmApal.c:2548
#9 0x00a7cf04 in oamMsgDispatchMsgList (msgList=msgList@entry=0x68dfae8, pMsgContext=pMsgContext@entry=0xee189bc8)
at src/software/saos-sds/leos/common/src/genericSwitch/cfm/src/oamMsg.c:92
#10 0x00a76e50 in cfmHalDispatchMsgList (cpe=<optimized out>, msgList=msgList@entry=0x68dfae8) at src/software/saos-sds/leos/common/src/genericSwitch/cfm/src/cfmApal.c:594
#11 0x00a8f2cc in CfmAgentMsgHdlr (sig=<optimized out>) at src/software/saos-sds/leos/common/src/genericSwitch/cfm/src/cfmAgent.c:1335
#12 0x00a9045c in cfmAgentTmoHdlr (cycle=<optimized out>, extraProcTimeMs=0) at src/software/saos-sds/leos/common/src/genericSwitch/cfm/src/cfmAgent.c:1383
#13 0x00a90598 in cfmAgentMain (arg=<optimized out>) at src/software/saos-sds/leos/common/src/genericSwitch/cfm/src/cfmAgent.c:1441
#14 0x0112ea8c in thread_prologue (arg=<optimized out>) at src/software/saos-sds/leos/os/linux/src/ose_shim.c:1273
#15 0xf704af8c in start_thread (arg=0xee18a3e0) at pthread_create.c:335
#16 0xf646b0a0 in ?? () at ../sysdeps/unix/sysv/linux/arm/clone.S:89
from /localdata/perforce/ankgoyal/oneos/branches/saos-sds/dev/main/build/saos-sds/fs/eredan_tarfs/debug/eredan/armv7a/lib/libc.so.6
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) ore was generated by `/mnt/apps/bin/leos -s'.
(gdb) Program terminated with signal SIGBUS, Bus error.

Solution

  • SIGBUS is a software signal, generated by the Linux kernel, so you need to understand why the kernel is generating a SIGBUS signal. This may or may not be due to a hardware exception.

    Make sure that the data is correctly align for its type and for what you're doing with it. One of the reasons for a SIGBUS is invalid alignment. Try reproducing the fault with unoptimized code (e.g. for GCC or Clang, without passing a -O option).

    If you've verified that alignment isn't the problem, check in which range the access is. If you're getting a SIGBUS due to an access to a device bus, you'll need to figure out how this memory is mapped into your process, and what the device exposes at that address.

    Do check the kernel logs. They may contain debugging information from the generation of the SIGBUS.

    If the signal is due to a hardware fault, the reason for the exception is indeed conveyed in a register, but only the kernel gets to read the value of this register. The relevant registers are DFSR and DFAR for a failed data fetch or store, and IFSR and IFAR for an instruction fault. However you can only use this information if you manage to find what's going on inside the kernel.