1. ARM64 Crash调试环境搭建

主要参考资料:

1.1. x86_64主机编译ARM64 Crash工具

在x86机器上安装的Crash工具不能直接调试ARM64的coredump文件。可以通过重新编译Crash源码来生成调试ARM64 coredump文件的Crash。参考https://github.com/crash-utility/crash/tree/8.0.3的README中提供的编译方法,解压代码后,执行 make target=ARM64即可。

 1  To build the crash utility:
 2
 3    $ tar -xf crash-8.0.3.tar.gz
 4    $ cd crash-8.0.3
 5    $ make
 6
 7  The initial build will take several minutes  because the embedded gdb module
 8  must be configured and built.  Alternatively, the crash source RPM file
 9  may be installed and built, and the resultant crash binary RPM file installed.
10
11  The crash binary can only be used on systems of the same architecture as
12  the host build system.  There are a few optional manners of building the
13  crash binary:
14
15  o  On an x86_64 host, a 32-bit x86 binary that can be used to analyze
16     32-bit x86 dumpfiles may be built by typing "make target=X86".
17  o  On an x86 or x86_64 host, a 32-bit x86 binary that can be used to analyze
18     32-bit arm dumpfiles may be built by typing "make target=ARM".
19  o  On an x86 or x86_64 host, a 32-bit x86 binary that can be used to analyze
20     32-bit mips dumpfiles may be built by typing "make target=MIPS".
21  o  On an ppc64 host, a 32-bit ppc binary that can be used to analyze
22     32-bit ppc dumpfiles may be built by typing "make target=PPC".
23  o  On an x86_64 host, an x86_64 binary that can be used to analyze
24     arm64 dumpfiles may be built by typing "make target=ARM64".
25  o  On an x86_64 host, an x86_64 binary that can be used to analyze
26     ppc64le dumpfiles may be built by typing "make target=PPC64".
27  o  On an x86_64 host, an x86_64 binary that can be used to analyze
28     riscv64 dumpfiles may be built by typing "make target=RISCV64".

1.2. x86_64主机使用Docker安装ARM64 Crash工具

编译安装的方法需要自己解决依赖,编译也相对耗时,这里使用Docker来解决。后续也使用这种方式来介绍。

Docker的安装可以参考清华大学开源软件镜像站Docker CE 软件仓库镜像使用帮助,之后可以直接使用,也可以参考其他资料配置国内加速源。安装Docker后按如下步骤执行。

1# 安装ARM64的运行支持
2sudo apt-get install qemu-user-static binfmt-support
3# 拉取arm64v8的debian镜像
4docker pull arm64v8/debian:bookworm-20230814
5# 启动docker容器,并挂载宿主机目录,Crash调试时需要vmlinux文件
6docker run  -it --privileged -v /storage/data:/data arm64v8/debian:bookworm-20230814 bash

启动docker容器,可以安装Crash工具和辅助工具。

1apt install crash vim less

安装完成后,可以在宿主机使用 docker commit [OPTIONS] CONTAINER [REPOSITORY[:TAG]]将容器保存为镜像,之后可以在宿主机上使用镜像启动容器。

docker容器内使用 crash --buildinfo可以查看Crash工具支持的架构,这里提示 build_target是ARM64。

1crash   --buildinfo
2   build_command: crash
3      build_data: reproducible build
4    build_target: ARM64
5   build_version: 8.0.2
6compiler version: gcc (Debian 12.2.0-9) 12.2.0

2. 编译配置

这里不过多介绍编译的细节,只介绍编译所需要的配置。

2.1. 内核配置和编译

ARM64的内核编译需要打开如下宏,文档中没有提到 CONFIG_RANDOMIZE_BASECONFIG_PROC_KCORE,但也是需要的,而且传递到内核的cmdline不能有nokaslr。

1CONFIG_KEXEC=y
2CONFIG_SYSFS=y
3CONFIG_DEBUG_INFO=y
4CONFIG_RELOCATABLE=y
5CONFIG_RANDOMIZE_BASE=y
6CONFIG_CRASH_DUMP=y
7CONFIG_PROC_KCORE=y
8CONFIG_PROC_VMCORE=y

为了可以使用sysrq触发panic,还需要开启如下几个选项。

1CONFIG_MAGIC_SYSRQ=y
2CONFIG_MAGIC_SYSRQ_DEFAULT_ENABLE=0x1
3CONFIG_MAGIC_SYSRQ_SERIAL=y

2.2. buildroot配置和编译

使能kexec工具。

1BR2_PACKAGE_KEXEC=y
2BR2_PACKAGE_KEXEC_ZLIB=y

3. 运行

3.1. qemu启动kernel

qemu启动内核的命令如下,可以根据自己需要修改。

1qemu-system-aarch64 -nographic  -cpu cortex-a57 -M type=virt,mte=off,virtualization=false,gic-version=3 -semihosting -semihosting-config enable=on,target=native -smp 2 -m 1024 -netdev user,id=net0 -device virtio-net-device,netdev=net0 -object rng-random,filename=/dev/urandom,id=rng0 -device virtio-rng-pci,rng=rng0,max-bytes=1024,period=1000 \
2	-kernel /data/eel/images/arm64/kernel/linux-6.6/arm64_debug_defconfig/arch/arm64/boot/Image -drive if=none,format=raw,file=/data/eel/images/arm64/rootfs/buildroot-2023.02/mine_arm64_defconfig/images/rootfs.ext4,id=hd0 -device virtio-blk-device,drive=hd0 \
3	-append "console=ttyAMA0 crashkernel=512M-1G:64M,1G-:128M loglevel=6 initcall_debug no_console_suspend root=/dev/vda rw init=/linuxrc" \
4	-fsdev local,id=share_comm,path=/data/eel/share,security_model=none -device virtio-9p-device,fsdev=share_comm,mount_tag=mnt_comm -fsdev local,id=share_arch,path=/data/eel/images/arm64,security_model=none -device virtio-9p-device,fsdev=share_arch,mount_tag=mnt_arch -fsdev local,id=share_eel,path=/data/eel,security_model=passthrough,readonly=on -device virtio-9p-device,fsdev=share_eel,mount_tag=mnt_eel

这里要注意cmdline参数不能带有nokaslr。cmdline要添加crashkernel,表示为crashkernel预留的内存。

  • crashkernel=64M:为crashkernel预留64M内存。
  • crashkernel=512M-1G:64M,1G-:128M:当系统总内存在512M-1G,预留64M,当系统总内存大于1G,预留128M。

另外需要将宿主机的目录映射到qemu内,以便qemu内可以读取编译生成的kernel Image文件,也可以把产生的vmcore文件导出到host。这里映射了3个目录,其中第一个是 -fsdev local,id=share_comm,path=/data/eel/share,security_model=none -device virtio-9p-device,fsdev=share_comm,mount_tag=mnt_comm。qemu内,可以通 过mount -t 9p share_comm /mnt来挂载。

3.2. kexec加载内核

qemu启动linux后,需要用kexec加载crashkernel,注意 maxcpus=1 reset_devices是必须的。

1kexec \
2    --append="console=ttyAMA0 rootfstype=ext4 rootwait root=/dev/vda rw maxcpus=1 reset_devices" \
3    -p /data/eel/images/arm64/kernel/linux-6.6/arm64_debug_defconfig/arch/arm64/boot/Image

上述命令如果没有输出,则表示执行成功。想要看到详细的信息,可以加上 -d选项。

3.3. 触发panic

1echo c > /proc/sysrq-trigger

重启时的日志,可以看到panic后,自动启动新的内核。

 1# echo c > /proc/sysrq-trigger
 2[  374.873622][  T160] sysrq: Trigger a crash
 3[  374.874185][  T160] Kernel panic - not syncing: sysrq triggered crash
 4[  374.874602][  T160] CPU: 0 PID: 160 Comm: sh Kdump: loaded Not tainted 6.6.0-g3149bc21d316-dirty #2 1c048aede95eceae4e303b1d563e246e55cb9a9d
 5[  374.875175][  T160] Hardware name: linux,dummy-virt (DT)
 6[  374.875392][  T160] Call trace:
 7[  374.875474][  T160]  dump_backtrace+0x104/0x130
 8[  374.875674][  T160]  show_stack+0x20/0x50
 9[  374.875723][  T160]  dump_stack_lvl+0x90/0xbc
10[  374.875847][  T160]  dump_stack+0x18/0x34
11[  374.875972][  T160]  panic+0x1a0/0x38c
12[  374.876039][  T160]  sysrq_handle_crash+0x24/0x2c
13[  374.876173][  T160]  __handle_sysrq+0xdc/0x1a4
14[  374.876231][  T160]  write_sysrq_trigger+0x78/0xa0
15[  374.876418][  T160]  proc_reg_write+0xc8/0x100
16[  374.876546][  T160]  vfs_write+0x168/0x330
17[  374.876724][  T160]  ksys_write+0x70/0xfc
18[  374.876906][  T160]  __arm64_sys_write+0x24/0x30
19[  374.877146][  T160]  el0_svc_common.constprop.0+0xfc/0x1f4
20[  374.877254][  T160]  do_el0_svc+0xb4/0xc0
21[  374.877339][  T160]  el0_svc+0x48/0xa0
22[  374.877420][  T160]  el0t_64_sync_handler+0xc8/0x14c
23[  374.877642][  T160]  el0t_64_sync+0x19c/0x1a0
24[  374.878124][  T160] SMP: stopping secondary CPUs
25[  374.878981][  T160] Starting crashdump kernel...
26[  374.879230][  T160] Bye!
27[    0.000000][    T0] Booting Linux on physical CPU 0x0000000000 [0x411fd070]
28[    0.000000][    T0] Linux version 6.6.0-g3149bc21d316-dirty (dix@EEL) (aarch64-linux-gnu-gcc (GCC) 11.3.1 20220604 [releases/gcc-11 revision 591c0f4b92548e3ae2e8173f4f93984b1c7f62bb], GNU ld (Linaro_Binutils-2022.06) 2.37.20220122) #2 SMP PREEMPT Tue Sep 19 01:09:13 CST 2023
29[    0.000000][    T0] Machine model: linux,dummy-virt
30[    0.000000][    T0] efi: UEFI not found.
31[    0.000000][    T0] OF: fdt: Reserving 1 KiB of memory at 0x7ebff000 for elfcorehdr

新的内核是用kexec加载好的内核,如果用cat /proc/cpuinfo查看的话,会显示只有一个核。

3.4. 保存vmcore

panic重启后生成的文件在 /proc/vmcore,正常启动则没有这个文件。将vmcore保存到与宿主机共享的目录

1cat /proc/vmcore > /mnt/vmcore_arm64_6.6

4. Crash调试

进入ARM64 Docker使用Crash工具分析

1crash vmcore_arm64_6.6  /data/eel/output/arm64/kernel/linux-6.6/arm64_debug_defconfig/vmlinux

加载需要一点时间,完成后会显示如下信息。

 1      KERNEL: /data/eel/output/arm64/kernel/linux-6.6/arm64_debug_defconfig/vmlinux
 2    DUMPFILE: vmcore_arm64_6.6
 3        CPUS: 2
 4        DATE: Sat Sep 23 09:47:30 CST 2023
 5      UPTIME: 00:06:14
 6LOAD AVERAGE: 0.00, 0.00, 0.00
 7       TASKS: 60
 8    NODENAME: buildroot
 9     RELEASE: 6.6.0-g3149bc21d316-dirty
10     VERSION: #2 SMP PREEMPT Tue Sep 19 01:09:13 CST 2023
11     MACHINE: aarch64  (unknown Mhz)
12      MEMORY: 1 GB
13       PANIC: "Kernel panic - not syncing: sysrq triggered crash"
14         PID: 160
15     COMMAND: "sh"
16        TASK: ffff65a602b68ec0  [THREAD_INFO: ffff65a602b68ec0]
17         CPU: 0
18       STATE: TASK_RUNNING (PANIC)

这里先简单使用 bt来看一下panic时的调用栈,可以看到是sysrq触发了panic。

 1crash> bt
 2PID: 160      TASK: ffff65a602b68ec0  CPU: 0    COMMAND: "sh"
 3 #0 [ffff800008413a10] machine_kexec at ffffd6678e834e98
 4 #1 [ffff800008413a40] __crash_kexec at ffffd6678e9148bc
 5 #2 [ffff800008413bd0] panic at ffffd6678f4f6a34
 6 #3 [ffff800008413cb0] sysrq_handle_crash at ffffd6678eee25f4
 7 #4 [ffff800008413cc0] __handle_sysrq at ffffd6678eee2e58
 8 #5 [ffff800008413d10] write_sysrq_trigger at ffffd6678eee34b4
 9 #6 [ffff800008413d30] proc_reg_write at ffffd6678eb88714
10 #7 [ffff800008413d50] vfs_write at ffffd6678eaf3c24
11 #8 [ffff800008413df0] ksys_write at ffffd6678eaf3f80
12 #9 [ffff800008413e30] __arm64_sys_write at ffffd6678eaf4030
13#10 [ffff800008413e40] el0_svc_common.constprop.0 at ffffd6678e829528
14#11 [ffff800008413e70] do_el0_svc at ffffd6678e8296d4
15#12 [ffff800008413e80] el0_svc at ffffd6678f5085c4
16#13 [ffff800008413ea0] el0t_64_sync_handler at ffffd6678f509948
17#14 [ffff800008413fe0] el0t_64_sync at ffffd6678e811de4
18     PC: 0000ffffb540c450   LR: 0000aaaad6f36968   SP: 0000fffff1d1fe50
19    X29: 0000fffff1d1fe50  X28: 0000aaaaf5de4700  X27: 0000000000000000
20    X26: 0000aaaaf5de46e0  X25: 0000000000000020  X24: 0000fffff1d1ff20
21    X23: 0000000000000001  X22: 0000ffffb55527a0  X21: 0000000000000002
22    X20: 0000aaaaf5deb100  X19: 0000000000000001  X18: 0000000000000001
23    X17: 0000ffffb540c420  X16: 0000aaaad6ff1948  X15: 65645f34366d7261
24    X14: 0000000000000001  X13: 0072656767697274  X12: 2d71727379732f63
25    X11: 2f686372612f6769  X10: 0000000000000000   X9: 0000000000000020
26     X8: 0000000000000040   X7: 7f7f7f7f7f7f7f7f   X6: 0000000000000063
27     X5: fffffffffffffffe   X4: 0000000000000001   X3: 0000ffffb5552010
28     X2: 0000000000000002   X1: 0000aaaaf5deb100   X0: 0000000000000001
29    ORIG_X0: 0000000000000001  SYSCALLNO: 40  PSTATE: 80000000