uboot differing behavior of manual vs autoboot

I have set the following:

setenv mtdparts spi0.0:512k@0(uboot)ro,3M@0x200000(Kernel),11M@0x500000(RootFS1),2M@0x200000(Kernel_legacy),256k@0x80000(U-Boot_Config),1280k@0xc0000(NAS_Config),16M@0(all)ro
setenv fbootcmd cp.l 0xf8200000 0x800000 0xc0000\;cp.l 0xf8500000 0xb00000 0x2C0000\;bootm 0x800000

setenv bootargs console=ttyS0,115200 root=/dev/ram initrd=0xb00000,0xB00000 ramdisk=34816 cmdlinepart.mtdparts=${mtdparts} mtdparts=${mtdparts}
setenv nbootcmd dhcp\;tftpboot 0x800000 dl-\${bootfile}\;bootm 0x800000
setenv wlanaddr 00:08:9b:cc:cb:cb
setenv ethaddr  00:08:9b:cc:cb:ca

setenv bootcmd   uart1 0x68\;printenv nbootcmd fbootcmd bootargs\;run nbootcmd\;echo fallback to flash boot\;run run fbootcmd

so the boot command runs one or two other scripts.

If I run this "by hand" using boot, bootd or run bootcmd....

I see a DHCP request arrive in my server
I see TFTP traffic via wireshark
The system boots fine.

If however I remove the magic jumper (JP1) and let it autoboot, what I see is:

=dhcp;tftpboot 0x800000 dl-${bootfile};bootm 0x800000
=cp.l 0xf8200000 0x800000 0xc0000;cp.l 0xf8500000 0xb00000 0x2C0000;bootm 0x800000
=console=ttyS0,115200 root=/dev/ram initrd=0xb00000,0xB00000 ramdisk=34816 cmdlinepart.mtdparts=spi0.0:512k@0(uboot)ro,3M@0x200000(Kernel),11M@0x500000(RootFS1),2M@0x200000(Kernel_legacy),256k@0x80000(U-Boot_Config),1280k@0xc0000(No

long delay here, with lots of ARP ...

fallback to flash boot

I don't see any DHCP or other network traffic, but after a few seconds (as per comment in above) I see ARP traffic WHOHAS using the 192.168.0.1 and 192.168.0.50 (NB in uboot ipaddr=192.168.0.50 serverip=192.168.0.1, neither exist)

Having stared at the output for some hours, it strikes me it's "as if" it's running everything in the bootcmd but simply skipping the two "run commands" so I see the output of the printenv and the echo but nothing else ...but also that output from printenv does not look quite right?

Can I use "run" in bootcmd? (is it legal) ... I could try simply saying:

setenv bootcmd   uart1 0x68\;${nbootcmd}\;echo fallback to flash boot\;${fbootcmd}

And having it all expanded, I guess. But I'm trying to avoid speculative flashes.

BTW, I see I have "run run", a typo , but I get no error message either. (again limiting flashes)

I think that boot/bootd should behave exactly the same way as an "autotboot".

Solution

Well that was really annoying. I'll write it up because I can't be the only one to hit it. There are some hardware related facets so I'll callout that (for search) [ QNAP, ts412, ts419, ts219 ]

First a huge aid in debugging this was I discovered the U-Boot "reset" command.

So before, in order to use auto-boot, I had to remove JP1 losing my console. Now I can leave JP1 in place and issue "reset" and it behaves exactly like you'd just powered on (as it happens this includes peripherals, which was critical to this case)

So Now I was able to "see the console" I could compare a "bad boot" (after a reset/power on) and a "good boot" by typing "boot/bootd".

First I've stripped down my bootcmd:

bootcmd=uart1 0x68;echo net;dhcp;tftpboot 0x800000 dl-${bootfile};bootm 0x800000;echo flash;cp.l 0xf8200000 0x800000 0xc0000;cp.l 0xf8500000 0xb00000 0x2C0000;bootm 0x800000

(I removed all the indirect [run] uses ... this was [probably] unnecessary ... tries a net boot falls-back to a flash boot)

----- reset---- (fails) ---
DHCPDISCOVER(enp4s0) 00:08:9b:cc:cb:ca 
CPU : Marvell Feroceon (Rev 1)


USB 0: host mode
PCI 0: PCI Express Root Complex Interface
PEX interface detected Link X1
Net:   egiga0 [PRIME], egiga1
Hit any key to stop autoboot:  0 
Unknown command 'uart1' - try 'help'
net
egiga0 no link
egiga1 no link
BOOTP broadcast 1
BOOTP broadcast 2
BOOTP broadcast 3
BOOTP broadcast 4
BOOTP broadcast 5

Retry count exceeded; starting again
Using egiga0 device
TFTP from server 192.168.0.1; our IP address is 192.168.0.50
Filename 'dl-'.
Load address: 0x800000

Loading: T T T T T T T T T T 
Retry count exceeded; starting again
## Booting image at 00800000 ...
Bad Magic Number
flash
## Booting image at 00800000 ...
   Image Name:   kernel 6.1.0-13-marvell
   Created:      2023-11-27  17:15:24 UTC
   Image Type:   ARM Linux Kernel Image (uncompressed)
   Data Size:    2634850 Bytes =  2.5 MB
   Load Address: 00008000
   Entry Point:  00008000
   Verifying Checksum ... OK
OK
----- reset----

------ boot cmd (works) -------
Marvell>> boot
Unknown command 'uart1' - try 'help'
net
BOOTP broadcast 1
Bootfile Prefix: F_TS-412
*** Unhandled DHCP Option in OFFER/ACK: 28
*** Unhandled DHCP Option in OFFER/ACK: 28
DHCP client bound to address 10.117.1.232
Using egiga0 device
TFTP from server 10.117.0.152; our IP address is 10.117.1.232

---------------------------

The clue was the

egiga0 no link
egiga1 no link
BOOTP broadcast 1
BOOTP broadcast 2

Which I believe is caused by the DHCP command, it had no working network interfaces, so it failed.

My conjecture is that the networking hardware had been reset and was still initialising. The manufacturer had used flash boots or (while in development) serial console+PiXE , so never tested this use case.

So the "fix" needs to cause a delay (bootdelay=5 did not help BTW) so a command in the boot sequence needs to "delay" ...we have a command that takes quite a long while, the failing "dhcp" , so we simply add another dhcp command, it fails, then the 2nd one works straight away.

bootcmd=uart1 0x68;echo net;dhcp;dhcp;tftpboot 0x800000 dl-${bootfile};bootm 0x800000;echo flash;cp.l 0xf8200000 0x800000 0xc0000;cp.l 0xf8500000 0xb00000 0x2C0000;bootm 0x800000

Pulled JP1 powered on and yep, it boots from the network.