Search code examples
assemblyx86nasmosdev

2nd Stage Bootloader stuck in bootloop


I'm trying to make a 2nd Stage bootloader that can load a kernel into protected mode (without using a filesys), but it keeps looping after entering the 2nd stage. My code is written in Asm and it's compiled with Nasm.

I have tried to change the sector number for loading the kernel in my 1st stage, but that just causes a failed int 0x13 read and causes the jmp to the 2nd stage to break. The error seems to be in the 2nd stage at the "or al, 1," but if I disable that it will disable Pm. I assume it has something to do with how I am compiling, because if I add the 1st and 2nd stage into one single assembly file, it will load the kernel with no problems. Is this just a problem with the sector I'm trying to load the kernel from? Thanks in advance! (Please note, I have been doing Osdev for only a couple of months, and this is my first Stackoverflow post. if I am doing something wrong, please let me know.)

Stg1.asm

; For context, Boot_Disk is 0, and Kernel_Location is 0x2000
[ORG 0x7C00]
mov [BOOT_DISK], dl

xor ax, ax                          
mov es, ax
mov ds, ax
mov bp, 0x8000
mov sp, bp

mov bx, KERNEL_LOCATION
mov dh, 2 ;number of sectors to read

mov ah, 0x02
mov al, dh 
mov ch, 0x00 ;cylinder
mov dh, 0x00 ;head
mov cl, 0x02 ;sector
mov dl, [BOOT_DISK]
int 0x13

jmp 0x0000 ;2nd stg

TIMES 510 - ($ - $$) db 0
dw 0xAA55

Stg2.asm

;Once again, Kernel_Location is 0x2000
[ORG 0x0]
cli
lgdt [gdt_descriptor]
mov eax, cr0
or al, 1
mov cr0, eax
jmp CODE_SEG:start_protected_mode


gdt_start:
gdt_null:
    dd 0x0
    dd 0x0

gdt_code:
    dw 0xffff
    dw 0x0
    db 0x0
    db 10011010b
    db 11001111b
    db 0x0

gdt_data:
    dw 0xffff
    dw 0x0
    db 0x0
    db 10010010b
    db 11001111b
    db 0x0
gdt_end:

gdt_descriptor:
    dw gdt_end - gdt_start - 1
    dd gdt_start

CODE_SEG equ gdt_code - gdt_start
DATA_SEG equ gdt_data - gdt_start


[bits 32]
start_protected_mode:
    mov ax, DATA_SEG
    mov ds, ax
    mov ss, ax
    mov es, ax
    mov fs, ax
    mov gs, ax
    
    mov ebp, 0x90000    ; 32 bit sbp
    mov esp, ebp

    jmp KERNEL_LOCATION

TIMES 2048-($-$$) db 0

run.sh

cat "boot1.bin" "boot2.bin" "full_kernel.bin" "zeros.bin"  > "OS.bin"

Solution

  • Stg1.asm

    1. You didn't set SS to anything (and only set SP) so your stack (at SS:SP) is at an unknowable address and could trash anything or be trashed by anything. Ideally you'd do a cli, mov ss, ..., mov sp, ..., sti sequence to fix this (and avoid a bug on an 8086).

    2. Your first mov [BOOT_DISK], dl is before you set DS, so it could write anywhere; and the later mov dl, [BOOT_DISK] (with a potentially different/known DS) can be reading an uninitialized value from a completely different address. You can fix it by shifting that mov [BOOT_DISK], dl until after segments are set up.

    3. You don't check if int 0x13 returns an error; so you can't assume that anything was loaded into memory. Note that real floppy disks were notoriously unreliable and in that case you'd want a "3 retries" loop before giving the user a nice descriptive error message that they can use to try to fix the problem.

    4. You have no idea what CS is (and only know that the (CS * 16 + IP) & 0xFFFFF calculation resulted in the address 0x00007C00, and that there's a few thousand different values of CS that can work); so the jmp 0x0000 is jumping to a known offset in an unknown segment, which is equivalent to jumping to an unknown address. You could fix this by using a far jump like jmp 0x0000:0x2000 but I'm unsure of the correct address (see later).

    Note: I don't know what KERNEL_LOCATION is so I can't check to see if the "Kernel_Location is 0x2000" comment/s are wrong.

    Stg2.asm

    1. [ORG 0x0] is wrong in protected mode. The new code segment that would be loaded by jmp CODE_SEG:start_protected_mode has a segment base of zero; and "offset zero in the protected mode segment that start at zero" is the address 0x00000000, which is where the BIOS's data is, and it's a bad idea to trash the BIOS data while you're using the BIOS (to load the 2nd stage and kernel).

    2. I suspect you're loading the 2nd stage at KERNEL_LOCATION. I'd recommend having a SECOND_STAGE_LOCATION, even if it must be contiguous (e.g. 2nd stage at 0x1E00 and kernel at 0x2000 so that they are contiguous in memory and both can be loaded with a single disk read).

    Notes

    1. It'd be a good idea to have some kind of "/src/doc/boot_phys_mem.txt" document that describes where everything (stack, 1st stage, 2nd stage, kernel, ...) is supposed to go in physical memory during boot, so it's easier to avoid bugs by making sure the documented goal is sane, then making sure the code matches the documented goal.

    2. There are lots of tutorials written by clueless beginners who wrote a dodgy mess full of bugs, then wanted to invent a practical purpose for their dodgy mess full of bugs, so they publish their dodgy mess full of bugs as a tutorial. Then more clueless beginners come along and "learn" to create a dodgy mess full of bugs from the tutorials written by clueless beginners. If you've been using any kind of tutorial; throw that tutorial in the trash where it probably belongs and try to find a better tutorial.

    3. NASM's directives have a "user level form" that you're supposed to use for almost everything (e.g. org 0x0000 and bits 32). There's also a lower level "primitive form" that is supposed to be used for tricky (sometimes internal) macros that need to bypass normal/extra behaviour (e.g. like [org 0x0000] and [bits 32]). For org and bits there's currently no difference, and you should use the user level form because a future version of NASM can create differences that break your code.

    4. For hard disks you need to deal with partitions and partition tables, and for floppy disks you need a BPB ( https://en.wikipedia.org/wiki/BIOS_parameter_block ). You have neither. Neither is not compatible with reality.

    5. Eventually you will need multiple different boot loaders - one for BIOS hard disk, one for BIOS floppy, one for "no emulation CD-ROM", one for network boot (then one or more for UEFI). For BIOS; it's a good idea to put some thought into code re-use in the beginning; so that code that is the same for all boot devices (getting memory map, setting up video, checking kernel has a valid header and wasn't corrupted, ...) doesn't get duplicated for each "device specific" boot loader. It can also be a good idea to forget about the obsolete stuff to save time (e.g. maybe consider just supporting UEFI only) - stuff that nobody uses now (e.g. floppy disks) aren't going to become more relevant after the many years it's going to take for you to write your OS.

    6. It'd be a good idea to check if the CPU meets your minimum requirements before you assume the CPU supports protected mode (and before the code crashes without any kind of "your CPU is unsupported" error message).