Search code examples
azurevirtual-machinehyper-vuefiazure-vm

Boot error after migrating 2nd-Gen Hyper-V hosted Gentoo Linux VM to Azure


My core task is to move a small, but important, Gentoo Linux running on Hyper-V onto Azure.

Ideally, the application it's running will be migrated to a different hosting model or Linux distro that is better supported on Azure. But due to various constraints, we're hoping to have this intermediate step which will allow us to move it, as is, to Azure.

After preparing the VHD and moving the image to Azure and provisioning a VM for it, it will not boot. Through the boot diagnostics, all I see is this: enter image description here

In an attempt to understand where I fail, I've tried to spin up a new local VM using the VHD prepared for Azure as the disk (Covered in (3)). When doing this, I get the same error like the one depicted above.

I'm unsure how to progress from here and would appreciate any feedback - details about how I have prepared the Gentoo VM, the VHD, and the VM on Azure follow below.

  1. Basic info about the Gentoo VM

  • It's held up to date and running 5.10.52 of the Linux kernel with Gentoo patches.
  • The VM is running as a "Generation 2" VM on Hyper-V meaning, among other things, that it's UEFI based
  • The VM is running off of a single disk
  1. Preparing the Gentoo VM

In order to prepare the Gentoo image, I've mainly referred to these pieces of documentation:

This means making sure all the Hyper-V-specific kernel requirements are met. The only thing, to my knowledge, we don't do right is run the waagent.

In order to prepare the VM for Azure, the VM is shut down. We use the Export function to export a copy of the VM to the filesystem.

  1. Preparing the VHD(X)

From there, we convert the VHDX to a Fixed Size VHD, per the requirements. This is done in PowerShell through the Hyper-V CmdLets:

Convert-VHD -Path .\Gentoo.vhdx -DestinationPath .\Gentoo-Fixed.vhd -VHDType:Fixed

After this, I basically run through this script in PowerShell to:

  • Create a new (Hyper-V Generation V2) Managed Disk
  • Create a shared-access-signature for it
  • Upload my VHD via AzCopy
  • Revoke the SAS
  • Create (Hyper-V Generation V2) image from the Disk
$azRegion  = 'northeurope'  # Geoprahical Location
$diskName  = 'gentoo-sda'   # Name of the Disk
$imageName = 'GentooAzure'
$rgName    = 'gentoo-host'  # Name of the Resource Group
$vhdSize   = (Get-Item .\Fixed-Gentoo.vhd).length

$diskConfig = New-AzDiskConfig -SkuName:Premium_LRS -OsType:Linux -HyperVGeneration:V2 -UploadSizeInBytes:$vhdSize -Location:$azRegion -CreateOption:'Upload'
New-AzDisk -ResourceGroupName:$rgName -DiskName:$diskName -Disk:$diskConfig

$disk = Get-AzDisk -DiskName:$diskName

#At this point $disk.DiskState should return "ReadyToUpload"


# Create a writeable shared-access-signature
$diskSAS = Grant-AzDiskAccess -ResourceGroupName:$rgName -DiskName:$diskName -DurationInSecond:86400 -Access:'Write'
$disk = Get-AzDisk -ResourceGroupName:$rgName -DiskName:$diskName

#At this point $disk.DiskState should return "ActiveUpload"

#Use AzCopy to upload the VHD
.\azcopy.exe copy ".\Fixed-Gentoo.vhd" $diskSAS.AccessSAS --blob-type PageBlob

#After the upload has been completed, revoke the SAS:
Revoke-AzDiskAccess -ResourceGroupName:$rgName -DiskName:$diskName

#Create Image from the managed disk
$imageConfig = New-AzImageConfig -Location:$location -HyperVGeneration:V2
$imageConfig = Set-AzImageOsDisk -Image:$imageConfig -OsState:Generalized -OsType:Linux -ManagedDiskId:$disk.Id
$image = New-AzImage -ImageName:$imageName -ResourceGroupName:$rgName -Image:$imageConfig
  1. Creating the VM

From this point on, I've been attempting different ways to get a VM up and running. My two main methods have been:

  • Create a VM and attach the managed disk as the OS disk
  • Create a VM based on the Image

Both seem to end in a non-booting machine which presents the UEFI error message posted above.

  1. Debugging

As mentioned, in an attempt to pinpoint exactly where in my process I'm failing, I've tried to take the Fixed VHD I end up with in step (2) and mount that to a new VM on our on-prem Hyper-V. This results in the same error I see on Azure.

From this point on, I'm a but unsure about how to approach this problem. Looking at the differences between the original VM, which works, and the non-functioning created from the exported VHD this particularly catches my eye:

These are the settings listed for the functioning Gentoo VM: enter image description here

While this is the settings from the non-functioning VM setup with the exported VHD: enter image description here

It seems that some crucial boot settings are disappearing when converting the VHDX to a Fixed VHD. But at this point, I'm not sure how to approach this problem.

Looking forward to your comments.


Solution

  • I never managed to figure out exactly what is happening. However, with help from the Gentoo folks on Discourse, we managed to conclude that the /boot partition was mangled during the transition from on-premise VM to Azure.

    (I suspect that this is what has happened every single time, no matter what transfer method I've been using.)

    I took a snapshot of the disk from the non-functional VM on Azure and mounted it to a newly created VM. From there, I mounted the /boot partition and concluded that the /boot/EFI folder only contained a single folder called gentoo which contained a grubx64.efi file.

    From there, I did a mkdir /boot/EFI/BOOT and cp /boot/EFI/gentoo/grubx64.efi /boot/EFI/BOOT/BOOTX64.EFI.

    After this, I unmounted the /boot partition and mounted this fixed drive back as the OS Disk of the VM.

    Success. It now boots!

    After logging in, I re-did the grub-install --target=x86_64-efi --efi-directory=/boot from the Gentoo Handbook.

    If anyone ever manages to find any documentation on why/how the transition from on-premise to Azure mangles the boot partition, do let me know. Nevertheless, the above seems to fix it.