Search code examples
cachingkernelmipsinterruptnetbsd

NetBSD kernel crash due to cache error - What is good point to start debugging?


Below is the stack trace of NetBSD 5.1 kernel crash. Basic data collected from minicore is :

-------------------------------------------------------------
                       VALID MAGIC: 0xdfeedfee
--------------------------PANIC STRING-----------------------  
panic string is :cache error @ EPC 0xffffffff80359a78 

L1D_CACHE_ERROR_LOG 0

L1D_CACHE_INTERRUPT 0 

status 0x8305, cause 0x78

Stack Trace :

sys/arch/evbmips/navasota/md_dump.c:52: 803e092c :

sys/arch/evbmips/rmixl/machdep.c:1247: 8032c468 :

sys/kern/subr_prf.c:313: 802934ec :

sys/bcm/soc/miim.c:1588: 804b6be0 :

sys/bcm/soc/phy/phyreg.c:1049: 806eb6fc :

sys/bcm/soc/phy/wc40.c:9257: 808a7914 :

sys/bcm/soc/phy/wc40.c:3664: 808acbe4 :

sys/bcm/soc/phyctrl.c:1124: 804bc6ac :

sys/bcm/bcm/esw/port.c:10104: 805fbd48 <_bcm_port_link_get+0x298>:

sys/bcm/bcm/esw/bcm_elink.c:1812: 805b803c <_bcm_esw_linkscan_update+0x27bc>:

sys/bcm/bcm/esw/bcm_elink.c:3201: 805ba83c <_bcm_esw_linkscan_thread+0x35c

??:0: 8031ab9c :

This is running on MIPS. I need help on two different things here :

1) I see that MIPS is managing its cache by software. What is this architecture? Will be great if someone helps me out with few pointers. When In my attempt to understand that, I find that this may be related to cache coherence issue. (Or any hardware problem ?)

2) What should be good starting point to debug this? Want to understand on how to decode status and cause mentioned above.


Solution

  • You can download MIPS architecture reference manuals from Imagination here. You should get a copy of See MIPS run for user friendly explanation of how the processors work.

    Your Cause register has an Excode of 30, which corresponds to CacheErr. See MIPS Run says that this is caused by an ECC or parity error in a cache, which sounds like a hardware failure.

    The Status register also shows an ECC/parity error.

    Try running the code on a different machine and see if it still fails.