Search code examples
ciscocisco-ios

Sysmgr volatile memory bug


As you may know that there is a bug with Nexus switches called SYSMGR-2-VOLATILE_DB_FULL with versions below System version: 5.0(2)N2(1) that causes a switch to crash reboot once dir /dev/shm gets to 100% unless updated to a later version.

in order to fill the dir you can run long commands such as "show run" (needs to be over 190 lines) and then check how it increases by running

show system internal flash
show system internal dir /dev/shm | i csm_acfg | count

I was wondering if there is a similar bug with 4500 switches ?

Catalyst 4500 L3 Switch Software (cat4500es8-UNIVERSALK9-M), Version 03.11.00.E RELEASE SOFTWARE (fc3)

So what exactly happened...

I have a script that runs from time to time that gets over 190 lines from all of our switches and performs some action remotely, so recently when the script ran a few minutes later we had a massive outage since our core switch had a power outage( at least what I was able to see from the logs) The thing is there are 2 4500 chassis configured with sso redundancy so the failover should have been instantaneous, however everything was down for about 8 mins before the standby switch became active.

Can anyone please advise if there is a similar bug with 4500 switches ?

Thank you.


Solution

  • After analysing crash info I was able to find some things that caused the crash, however wont be able to tell with 100 % certainty what exactly happened to crash it

    So there are a few errors that are called VFETQINTERRUPT and VFETQTOOMANYPARITYERRORS basically VFETQINTERRUPT counts fast accruing errors and VFETQTOOMANYPARITYERRORS will crash reboot switch if exceeds 100 errors in a short period of time, could indicate that there is a hardware error

    and this is pretty much what happened in out environment, something has caused 100+ errors and it crashed rebooted.

    There is a command to stop it from crash rebooting, however not sure if it should be used as if there is a hardware issue it better to failover onto the other supervisor.

    platform fw-asic dbl hash memory parity-error reload never