Quantcast
Channel: Randy Riness @ SPSCC aggregator
Viewing all articles
Browse latest Browse all 3015

MSDN Blogs: Linux Recovery: Manually fixing non-boot issues related to Kernel problems

$
0
0

You have a Linux VM that had kernel changes applied recently such as a kernel upgrade and is no longer starting up properly due to kernel errors during the boot process.

Kernel messages will vary; some examples could be:

  • no root device found,
  • kernel timeouts,
  • null pointer,
  • kernel panic errors.

Most of the time the recovery steps will be similar requiring that you need to either install a newer kernel or simply roll-back to the previous version manually.

Example of error messages that you can see

A) No root device found

dracut Warning: No root device “block:/dev/disk/by-uuid/6d089360-3e14-401d-91d0-378f3fd09332” found
dracut Warning: Boot has failed. To debug this issue add “rdshell” to the kernel command line.
dracut Warning: Signal caught!
Kernel panic – not syncing: Attempted to kill init!
Pid: 1, comm: init Not tainted 2.6.32-504.12.2.el6.x86_64 #1
Call Trace:
[<ffffffff8152933c>] ? panic+0xa7/0x16f
[<ffffffff8107a5f2>] ? do_exit+0x862/0x870
[<ffffffff8118fa25>] ? fput+0x25/0x30
[<ffffffff8107a658>] ? do_group_exit+0x58/0xd0
[<ffffffff8107a6e7>] ? sys_exit_group+0x17/0x20
[<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b

B) Example of a kernel timeout error

INFO: task swapper:1 blocked for more than 120 seconds.
Not tainted 2.6.32-504.8.1.el6.x86_64 #1
“echo 0 > /proc/sys/kernel/hung_task_timeout_secs” disables this message.
swapper       D 0000000000000000     0     1      0 0x00000000
ffff88010f64fde0 0000000000000046 ffff88010f64fd50 ffffffff81074f95
0000000000005c2f ffffffff8100bb8e ffff88010f64fe50 0000000000100000
0000000000000002 00000000fffb73e0 ffff88010f64dab8 ffff88010f64ffd8
Call Trace:
[<ffffffff81074f95>] ? __call_console_drivers+0x75/0x90
[<ffffffff8100bb8e>] ? apic_timer_interrupt+0xe/0x20
[<ffffffff81075d51>] ? vprintk+0x251/0x560
[<ffffffff8152a862>] schedule_timeout+0x192/0x2e0
[<ffffffff810874f0>] ? process_timeout+0x0/0x10
[<ffffffff8152a9ce>] schedule_timeout_uninterruptible+0x1e/0x20
[<ffffffff81089650>] msleep+0x20/0x30
[<ffffffff81c2a571>] prepare_namespace+0x30/0x1a9
[<ffffffff81c2992a>] kernel_init+0x2e1/0x2f7
[<ffffffff8100c20a>] child_rip+0xa/0x20
[<ffffffff81c29649>] ? kernel_init+0x0/0x2f7
[<ffffffff8100c200>] ? child_rip+0x0/0x20

C) Example of a kernel null point error

Pid: 242, comm: async/1 Not tainted 2.6.32-504.12.2.el6.x86_64 #1
Call Trace:
[<ffffffff81177468>] ? kmem_cache_create+0x538/0x5a0
[<ffffffff8152aede>] ? mutex_lock+0x1e/0x50
[<ffffffff81370424>] ? attribute_container_add_device+0x104/0x150
[<ffffffffa009c1de>] ? storvsc_device_alloc+0x4e/0xa0 [hv_storvsc]
[<ffffffff8138a1dc>] ? scsi_alloc_sdev+0x1fc/0x280
[<ffffffff8138a739>] ? scsi_probe_and_add_lun+0x4d9/0xe10
[<ffffffff8128e62d>] ? kobject_set_name_vargs+0x6d/0x70
[<ffffffff8152aede>] ? mutex_lock+0x1e/0x50
[<ffffffff81370424>] ? attribute_container_add_device+0x104/0x150
[<ffffffff81367ae9>] ? get_device+0x19/0x20
[<ffffffff8138b440>] ? scsi_alloc_target+0x2d0/0x300
[<ffffffff8138b661>] ? __scsi_scan_target+0x121/0x740
[<ffffffff8138bd07>] ? scsi_scan_channel+0x87/0xb0
[<ffffffff8138bde0>] ? scsi_scan_host_selected+0xb0/0x190
[<ffffffff8138bf51>] ? do_scsi_scan_host+0x91/0xa0
[<ffffffff8138c13c>] ? do_scan_async+0x1c/0x150
[<ffffffff810a7086>] ? async_thread+0x116/0x2e0
[<ffffffff81064b90>] ? default_wake_function+0x0/0x20
[<ffffffff810a6f70>] ? async_thread+0x0/0x2e0
[<ffffffff8109e66e>] ? kthread+0x9e/0xc0
[<ffffffff8100c20a>] ? child_rip+0xa/0x20
[<ffffffff8109e5d0>] ? kthread+0x0/0xc0
[<ffffffff8100c200>] ? child_rip+0x0/0x20
BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
IP: [<ffffffffa009c0a0>] storvsc_device_destroy+0x20/0x50 [hv_storvsc]
PGD 0

D) Example of a Kernel Panic error

Invalid op code: 0000 [#2] [11427.908676] — end trace 61a458bb863d7f0f ]—
Kernel panic – not syncing: attempted to kill the idle task!

Recovering from these issues

In all cases we will need to go through the normal procedure of deleting the affected Linux VM, keep its OSDisk, attach it to a new VM running the same version as the impacted VM or at least the same distribution and then, depending on the issue we will have two options of repairing which are described below.

1) Simply editing configuration files to roll-back the kernel and boot from a previous working setup, Linux boot loaders usually have more than one entry which defines which kernel to use for booting, that usually gets updated every time you perform an upgrade to reference the new installed kernel.

2) Go through a CHROOT process, by attaching the impacting VM OSDisk to a temporary new VM and running known tools (apt-get/yum/zypper) to install/reinstall a kernel. This is the easiest and usually fastest way to repair since you do not have to manually edit files.

Example, a Linux VM running CentOS 6.6 has loaded a kernel version 2.6.32-504.16.2.el6.x86_64, this can be seen with the command:
uname -a

Linux vfldev 2.6.32-504.16.2.el6.x86_64 #1 SMP Wed Apr 22 06:48:29 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

The boot loader configuration file that knows of the kernel version to load, in the case for CentOS the file is /boot/grub/grub.conf and on newer versions /boot/grub2/grub.cfg

Looking closer at the grub.conf we can see that the current loaded version is referenced in the first 4 lines highlighted in blue.

1) Title > Simply the title that shows up in the menu (not applicable in cloud environments)
2) The root disk
3) The kernel command line being loaded at boot time and its parameters
4) The path for the initrd file that will be loaded to boot and usually matches the kernel too

An example of grub.conf with reference to 4 Linux versions:

title CentOS (2.6.32-504.16.2.el6.x86_64)
root (hd0,0)
kernel /boot/vmlinuz-2.6.32-504.16.2.el6.x86_64 ro root=UUID=6d089360-3e14-401d-91d0-378f3fd09332 rd_NO_LUKS rd_NO_LVM LANG=en_US.UTF-8 rd_NO_MD SYSFONT=latarcyrheb-sun16 KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM numa=off console=ttyS0 earlyprintk=ttyS0 rootdelay=300 crashkernel=auto
initrd /boot/initramfs-2.6.32-504.16.2.el6.x86_64.img

title CentOS (2.6.32-504.12.2.el6.x86_64)
root (hd0,0)
kernel /boot/vmlinuz-2.6.32-504.12.2.el6.x86_64 ro root=UUID=6d089360-3e14-401d-91d0-378f3fd09332 rd_NO_LUKS rd_NO_LVM LANG=en_US.UTF-8 rd_NO_MD SYSFONT=latarcyrheb-sun16 KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM numa=off console=ttyS0 earlyprintk=ttyS0 rootdelay=300 crashkernel=auto
initrd /boot/initramfs-2.6.32-504.12.2.el6.x86_64.img

title CentOS (2.6.32-504.8.1.el6.x86_64)
root (hd0,0)
kernel /boot/vmlinuz-2.6.32-504.8.1.el6.x86_64 ro root=UUID=6d089360-3e14-401d-91d0-378f3fd09332 rd_NO_LUKS rd_NO_LVM LANG=en_US.UTF-8 rd_NO_MD SYSFONT=latarcyrheb-sun16 KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM numa=off console=ttyS0 earlyprintk=ttyS0 rootdelay=300 crashkernel=auto
initrd /boot/initramfs-2.6.32-504.8.1.el6.x86_64.img

title CentOS (2.6.32-431.29.2.el6.x86_64)
root (hd0,0)
kernel /boot/vmlinuz-2.6.32-431.29.2.el6.x86_64 ro root=UUID=6d089360-3e14-401d-91d0-378f3fd09332 rd_NO_LUKS rd_NO_LVM LANG=en_US.UTF-8 rd_NO_MD SYSFONT=latarcyrheb-sun16 KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM numa=off console=ttyS0 earlyprintk=ttyS0 rootdelay=300
initrd /boot/initramfs-2.6.32-431.29.2.el6.x86_64.img

To change the boot loader (grub.conf) and force the Linux VM to load a different kernel manual intervention is required, below you can find steps to do that change.

NOTE: We highly recommend making a backup of the VHD from the inaccessible VM before going through the steps for the recovery process, you can make a backup of the VHD by using Microsoft Storage Explorer, available at http://storageexplorer.com

A = Original VM (Inaccessible VM)
B = New VM (New Recovery VM)

  1. Stop VM  via Azure Portal
  2. For Resource Manager VM, we recommend to save the current VM information before deleting
    • Azure CLI:                  azure vm show ResourceGroupName LinuxVmName > ORIGINAL_VM.txt
    • Azure PowerShell:     Get-AzureRmVM -ResourceGroupName $rgName -Name $vmName
  3. Delete VM BUT select “keep the attached disks
    NOTE: The option to keep the attached disks is only available for classic deployments, for Resource Manager deleting a VM will always keep its OSDisk by default.
  4. Once the lease is cleared, attach the Data Disk from to VM via the Azure Portal, Virtual Machines, Select “B”, Attach Disk
  5. On VM “B” eventually the disk will attach and you can then mount it.
  6. Locate the drive name to mount, on VM “B” look in relevant log file note each Linux is slightly different.
    • grep SCSI /var/log/kern.log (ubuntu, debian)
      grep SCSI /var/log/messages (centos, suse, oracle, redhat)
  7. Mount the attached disk onto mountpoint /rescue
    df -h
    mkdir /rescue
    For Red Hat 7.2+
    mount -o nouuid /dev/sdc2 /rescue
    For CentOS 7.2+
    mount -o nouuid /dev/sdc1 /rescue
    For Debian 8.2+, Ubuntu 16.04+, SUSE 12 SP4+
    mount /dev/sdc1 /rescue
  8. Mount the OS disk on /rescue on VM B, and modify the /rescue/boot/grub.conf file.Modified grub.conf

#title CentOS (2.6.32-504.16.2.el6.x86_64)
#root (hd0,0)
#kernel /boot/vmlinuz-2.6.32-504.16.2.el6.x86_64 ro root=UUID=6d089360-3e14-401d-91d0-378f3fd09332 rd_NO_LUKS rd_NO_LVM LANG=en_US.UTF-8 rd_NO_MD SYSFONT=latarcyrheb-sun16 KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM numa=off console=ttyS0 earlyprintk=ttyS0 rootdelay=300 crashkernel=auto
#initrd /boot/initramfs-2.6.32-504.16.2.el6.x86_64.img

Another item to check is if the device specified by the UUID value from the boot loader file, grub.conf exists.

Taking our example of the OS disk mounted on /rescue look in /rescue/dev/disk/by-uuid

We can clearly see that there is a corresponding UUID file found on disk highlighted in pink that is referenced in the grub.conf file

The file is actually a symbolic link as denoted by the l at the start of the attributes lrwxrwxrwx and is pointing to the OSDisk sda1

Even if this file is missing, testing deletion of the symbolic link is recreated when the system starts up.

It is possible to create symbolic links manually you would need to know if sda1 is your boot device and the corresponding UUID.

cd /rescue/dev/disk/by-uuid
ln -s ../../sda1 6d089360-3e14-401d-91d0-378f3fd09332

The sda1 file is known as a block device in Linux seen by the ls output and denoted by a b in the attributes brw-rw—- or by running the file command.

You could also check that this file is present again testing on a centos 6.5 this file also gets recreated if missing.

cd /rescue/dev/disk/by-uuid
ls -ltr ../../sda1
brw-rw—- 1 root disk 8, 1 Apr 30 14:37 ../../sda1

file ../../sda1
../../sda1: block special

2) If the VM still does not boot after having tried to boot from a previous kernel next you could try to rebuild the initramfs and copy over a new compressed image of the Linux kernel this is seen in the example in the grub file.

Initramfs file
initrd /boot/initramfs-2.6.32-504.16.2.el6.x86_64.img

Kernel file
/boot/vmlinuz-2.6.32-504.16.2.el6.x86_64

Normally, in an on-premise environment you would boot a system from a recovery cd. In Cloud environments we have to attach the OS disk to a temporary VM of the same OS and Version for recovering or manipulating  system files for a no-boot scenario, this is reinforced even more if we are going to attempt a copy or recreate initramfs and kernel files

Once the OS disk is attached to a temp VM on /rescue (first secure any data by copying off the OS disk)

Revert previously made changes to grub.conf if you had commented out entries referring to the first kernel to boot re-instate them.

Then  proceed to rebuild initramfs.

mv /rescue/boot/initramfs-2.6.32-504.8.1.el6.x86_64.img  /rescue/boot/initramfs-2.6.32-504.8.1.el6.x86_64.old-img

On the temporary CentOS 6.6 Linux VM we were unable to locate the exact same initramfs, hence building and using the latest version available on the temp.

dracut /rescue/boot/initramfs-2.6.32-504.8.1.el6.x86_64.img  2.6.32-504.8.1.el6.x86_64

Then to be safe copy relevant vmlinuz file and finally update the grub.conf to reflect new kernel values:
ls -ltr /lib/modules/

drwxr-xr-x. 7 root root 4096 Apr 14  2014 2.6.32-431.11.2.el6.x86_64
drwxr-xr-x. 7 root root 4096 Jun  4  2014 2.6.32-431.17.1.el6.x86_64
drwxr-xr-x. 7 root root 4096 Sep 25  2014 2.6.32-431.29.2.el6.x86_64
drwxr-xr-x. 7 root root 4096 Nov 18 18:54 2.6.32-504.1.3.el6.x86_64
drwxr-xr-x. 7 root root 4096 Mar 25 19:28 2.6.32-504.12.2.el6.x86_64

dracut /rescue/boot/initramfs-2.6.32-504.12.2.el6.x86_64.img 2.6.32-504.12.2.el6.x86_64

ls -ltr /rescue/boot/initramfs-2.6.32-504.12.2.el6.x86_64.img
-rw——-. 1 root root 19354168 May  6 15:39 /rescue/boot/initramfs-2.6.32-504.12.2.el6.x86_64.img

cp /boot/vmlinuz-2.6.32-504.12.2.el6.x86_64 /rescue/boot/

ls -ltr /rescue/boot/vmlinuz*

-rwxr-xr-x. 1 root root 4128368 Nov 22  2013 /rescue/boot/vmlinuz-2.6.32-431.el6.x86_64
-rwxr-xr-x. 1 root root 4128688 Jan  3  2014 /rescue/boot/vmlinuz-2.6.32-431.3.1.el6.x86_64
-rwxr-xr-x. 1 root root 4129872 May  7  2014 /rescue/boot/vmlinuz-2.6.32-431.17.1.el6.x86_64
-rwxr-xr-x. 1 root root 4131984 Sep  9  2014 /rescue/boot/vmlinuz-2.6.32-431.29.2.el6.x86_64
-rwxr-xr-x. 1 root root 4153008 Jan 28 21:40 /rescue/boot/vmlinuz-2.6.32-504.8.1.el6.x86_64
-rwxr-xr-x. 1 root root 4152720 May  6 15:44 /rescue/boot/vmlinuz-2.6.32-504.12.2.el6.x86_64

vi /rescue/boot/grub/grub.conf

title CentOS (2.6.32-504.12.2.el6.x86_64)
root (hd0,0)
kernel /boot/vmlinuz-2.6.32-504.12.2.el6.x86_64 ro root=UUID=6d089360-3e14-401d-91d0-378f3fd09332 rd_NO_LUKS rd_NO_LVM LANG=en_US.UTF-8 rd_NO_MD SYSFONT=latarcyrheb-sun16 crashkernel=auto  KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM rhgb quiet numa=off console=ttyS0 earlyprintk=ttyS0 rootdelay=300
initrd /boot/initramfs-2.6.32-504.12.2.el6.x86_64.img

cd /
umount /rescue

  1. Detach the disk from VM B via the Azure portal
  2. Recreate the original VM A from the repaired VHD

For a Classic VM:

Recreate the original VM A (Create VM from Gallery, Select My Disks) you will see the Disk referring to VM A – Select the original Cloud Service name.

For a Resource Manager VM you will need to use either Powershell or Azure CLI tools, the articles below have steps to recreate a VM from its original VHD:

Azure PowerShell: How to delete and re-deploy a VM from VHD
Azure CLI: How to delete and re-deploy a VM from VHD

 


Viewing all articles
Browse latest Browse all 3015

Trending Articles