Serj Babayan

Analyzing kernel panics in NixOS

Recently I got into self-hosting. I took out my old Ryzen gaming PC and decided to re-purpose it into one, as I don't have as much time as I'd like to do gaming.

An important part of self-hosting, or any kind of production deployment is keeping your host up.

For some unknown reason, after going idle / sleep, my host would become unresponsive. I wouldn't be able to SSH in, and mouse / keyboard inputs stop working.

There are many different ways that your computer can freeze up, from minor to really bad:

  1. Application stuttering - usually out of cpu

  2. Application freezing/becoming unresponsive - usually out of memory

  3. Your desktop environment freezing but you can ssh into host - some graphics interaction issue

  4. Your desktop environment freezing and you can't ssh into host - kernel panic

  5. Everything freezing so bad you normal kernel panic triage doesn't work - hardware issue

In this post, I'll go over what you can do in case you have a kernel panic, step by step, from easy to more and more involved, so you're able to diagnose what's going wrong on your machine.

There are a million and one ways to debug issues like this, but I found the resources to be too scattered around. Hopefully the 20hr+ I spent on this issue helps someone else spend less time.

I'll provide non-Nix instructions so you can follow along even if you don't use it.

1. Stopping reboots on panic

The point of this is to give the kernel the best chance at just outputting the panic logs to your monitor. If you don't have a monitor connected to the system, then you can follow the steps below, though I highly suggest trying it out first.

Note that you should use kernel boot parameters if your panic is occurring during the boot stage. If its occurring after, it's fine to use sysctl if you don't want to persist anything.

  # Show panic on console and don't reboot automatically
  boot.kernelParams = [
    "panic=0"              # Don't auto-reboot on panic (0 = never)
    "oops=panic"           # Treat oops as panics
    "softlockup_panic=1",
    "loglevel=8"           # Lower this if there's too much noise on screen
    "debug"                # Enable general kernel debugging
  ];

  boot.kernel.sysctl = {
    "kernel.hung_task_panic" = 1;
    "kernel.hardlockup_panic" = 1;
    "kernel.panic_on_rcu_stall" = 1;
  };

Or the sysctl equivalent for the current boot only::

sysctl kernel.panic=0
sysctl kernel.panic_on_oops=1
sysctl kernel.softlockup_panic=1
sysctl kernel.hung_task_panic=1
sysctl kernel.hardlockup_panic=1
sysctl kernel.panic_on_rcu_stall=1
sysctl kernel.printk="8 4 1 7"

And you can verify these with:

sysctl kernel.panic kernel.panic_on_oops kernel.softlockup_panic \
       kernel.hung_task_panic kernel.hardlockup_panic \
       kernel.panic_on_rcu_stall kernel.printk

Now, the next time there's a kernel panic, hopefully you'll see something useful on screen.

2. Checking the logs

Linux kernel messages are stored in a ring buffer called dmesg. You can access the messages in this ring buffer using the same named command.

sudo dmesg

This ring buffer is stored in RAM, so when you restart your computer, it'll dissappear.

Journalctl

NixOS is based on systemd, which means you can use the journalctl command to view any relevant logs. Part of this system is a daemon that will flush any messages from the kernel ring buffer onto your file system, allowing you to look at kernel logs from a previous (panicked) boot.

After a freeze, you can restart your PC (hard restart is ok), and run:

journalctl -b -1

The -b -1 tells it to look at the logs from the previous boot. This will be ALL logs, both from kernel and userland. If you want to see the logs from your current boot for example, you can do -b 0. If you want to look at the second boot your machine has ever performed, you can do -b 1, and so on..

If its too noisy, you can filter for kernel messages only with the -k flag.

journalctl -b -1 -k

Look at the last few lines of messaging to see if you see anything strange. There may not be much information on the kernel panic itself, as the linux kernel will stop flushing to disk in case of a panic to prevent further data corruption.

However try to replicate the panic as best you can to see if there's a common pattern. Using AI tools like ChatGPT can be helpful in parsing out the logs from multiple panics as well.

3. Enabling more verbose logs

In case the above logs weren't sufficient, try enabling more verbose logs and see if that gives you any more hints.

Nix way

In Nix, you can enable more verbose logs from the kernel with this:


  boot.consoleLogLevel = 8; # default 4. 8 should be more verbose

Non-Nix

If you don't want to do it declaratively, you can also apply only to your current boot:

sysctl kernel.printk="8 4 1 7"

Or to make it persistent, you can apply it as a kernel parameter. There are many ways of doing this depending on the OS you're running.

# this is what the boot.consoleLogLevel does under the hood
boot.kernelParams = [
        "loglevel=8"
      ]

And confirm with this to print out all your boot-level kernel parameters:

cat /proc/cmdline

Verifying

Regardless of how you got this done, run this command to verify you have verbose logging enabled:

cat /proc/sys/kernel/printk
8       4       1       7

4. Retrieving logs from the panic

If the logs leading up to the panic aren't sufficient, the next step is to try and retrieve any logs made during the panic.

By default, the linux kernel will not flush to disk during a panic. You have two options when it comes to getting any logs:

  1. Using a serial cable
  2. Enabling netconsole
  3. Enable kdump

Serial

Serial logging is a pretty simple protocol, and you can configure the linux kernel to output all its kernel messages to it. Most computers don't have a serial port anymore, so you'll have to use a USB - USB null modem cable, and configure the client PC to parse these serial logs. I didn't go for this because you can actually use an ethernet connection to accomplish the same thing with netconsole.

Netconsole

TBD..