1. ARM64 Crash调试环境搭建
主要参考资料:
- 内核源码
Documentation/admin-guide/kdump/kdump.rst
或者Documentation for Kdump - The kexec-based Crash Dumping Solution - Crash白皮书:https://crash-utility.github.io/crash_whitepaper.html
- Crash github:https://github.com/crash-utility/crash
- Oracle: Using the crash Debugger
1.1. x86_64主机编译ARM64 Crash工具
在x86机器上安装的Crash工具不能直接调试ARM64的coredump文件。可以通过重新编译Crash源码来生成调试ARM64 coredump文件的Crash。参考https://github.com/crash-utility/crash/tree/8.0.3的README中提供的编译方法,解压代码后,执行 make target=ARM64
即可。
1 To build the crash utility:
2
3 $ tar -xf crash-8.0.3.tar.gz
4 $ cd crash-8.0.3
5 $ make
6
7 The initial build will take several minutes because the embedded gdb module
8 must be configured and built. Alternatively, the crash source RPM file
9 may be installed and built, and the resultant crash binary RPM file installed.
10
11 The crash binary can only be used on systems of the same architecture as
12 the host build system. There are a few optional manners of building the
13 crash binary:
14
15 o On an x86_64 host, a 32-bit x86 binary that can be used to analyze
16 32-bit x86 dumpfiles may be built by typing "make target=X86".
17 o On an x86 or x86_64 host, a 32-bit x86 binary that can be used to analyze
18 32-bit arm dumpfiles may be built by typing "make target=ARM".
19 o On an x86 or x86_64 host, a 32-bit x86 binary that can be used to analyze
20 32-bit mips dumpfiles may be built by typing "make target=MIPS".
21 o On an ppc64 host, a 32-bit ppc binary that can be used to analyze
22 32-bit ppc dumpfiles may be built by typing "make target=PPC".
23 o On an x86_64 host, an x86_64 binary that can be used to analyze
24 arm64 dumpfiles may be built by typing "make target=ARM64".
25 o On an x86_64 host, an x86_64 binary that can be used to analyze
26 ppc64le dumpfiles may be built by typing "make target=PPC64".
27 o On an x86_64 host, an x86_64 binary that can be used to analyze
28 riscv64 dumpfiles may be built by typing "make target=RISCV64".
1.2. x86_64主机使用Docker安装ARM64 Crash工具
编译安装的方法需要自己解决依赖,编译也相对耗时,这里使用Docker来解决。后续也使用这种方式来介绍。
Docker的安装可以参考清华大学开源软件镜像站Docker CE 软件仓库镜像使用帮助,之后可以直接使用,也可以参考其他资料配置国内加速源。安装Docker后按如下步骤执行。
1# 安装ARM64的运行支持
2sudo apt-get install qemu-user-static binfmt-support
3# 拉取arm64v8的debian镜像
4docker pull arm64v8/debian:bookworm-20230814
5# 启动docker容器,并挂载宿主机目录,Crash调试时需要vmlinux文件
6docker run -it --privileged -v /storage/data:/data arm64v8/debian:bookworm-20230814 bash
启动docker容器,可以安装Crash工具和辅助工具。
1apt install crash vim less
安装完成后,可以在宿主机使用 docker commit [OPTIONS] CONTAINER [REPOSITORY[:TAG]]
将容器保存为镜像,之后可以在宿主机上使用镜像启动容器。
docker容器内使用 crash --buildinfo
可以查看Crash工具支持的架构,这里提示 build_target
是ARM64。
1crash --buildinfo
2 build_command: crash
3 build_data: reproducible build
4 build_target: ARM64
5 build_version: 8.0.2
6compiler version: gcc (Debian 12.2.0-9) 12.2.0
2. 编译配置
这里不过多介绍编译的细节,只介绍编译所需要的配置。
2.1. 内核配置和编译
ARM64的内核编译需要打开如下宏,文档中没有提到 CONFIG_RANDOMIZE_BASE
和 CONFIG_PROC_KCORE
,但也是需要的,而且传递到内核的cmdline不能有nokaslr。
1CONFIG_KEXEC=y
2CONFIG_SYSFS=y
3CONFIG_DEBUG_INFO=y
4CONFIG_RELOCATABLE=y
5CONFIG_RANDOMIZE_BASE=y
6CONFIG_CRASH_DUMP=y
7CONFIG_PROC_KCORE=y
8CONFIG_PROC_VMCORE=y
为了可以使用sysrq触发panic,还需要开启如下几个选项。
1CONFIG_MAGIC_SYSRQ=y
2CONFIG_MAGIC_SYSRQ_DEFAULT_ENABLE=0x1
3CONFIG_MAGIC_SYSRQ_SERIAL=y
2.2. buildroot配置和编译
使能kexec工具。
1BR2_PACKAGE_KEXEC=y
2BR2_PACKAGE_KEXEC_ZLIB=y
3. 运行
3.1. qemu启动kernel
qemu启动内核的命令如下,可以根据自己需要修改。
1qemu-system-aarch64 -nographic -cpu cortex-a57 -M type=virt,mte=off,virtualization=false,gic-version=3 -semihosting -semihosting-config enable=on,target=native -smp 2 -m 1024 -netdev user,id=net0 -device virtio-net-device,netdev=net0 -object rng-random,filename=/dev/urandom,id=rng0 -device virtio-rng-pci,rng=rng0,max-bytes=1024,period=1000 \
2 -kernel /data/eel/images/arm64/kernel/linux-6.6/arm64_debug_defconfig/arch/arm64/boot/Image -drive if=none,format=raw,file=/data/eel/images/arm64/rootfs/buildroot-2023.02/mine_arm64_defconfig/images/rootfs.ext4,id=hd0 -device virtio-blk-device,drive=hd0 \
3 -append "console=ttyAMA0 crashkernel=512M-1G:64M,1G-:128M loglevel=6 initcall_debug no_console_suspend root=/dev/vda rw init=/linuxrc" \
4 -fsdev local,id=share_comm,path=/data/eel/share,security_model=none -device virtio-9p-device,fsdev=share_comm,mount_tag=mnt_comm -fsdev local,id=share_arch,path=/data/eel/images/arm64,security_model=none -device virtio-9p-device,fsdev=share_arch,mount_tag=mnt_arch -fsdev local,id=share_eel,path=/data/eel,security_model=passthrough,readonly=on -device virtio-9p-device,fsdev=share_eel,mount_tag=mnt_eel
这里要注意cmdline参数不能带有nokaslr。cmdline要添加crashkernel,表示为crashkernel预留的内存。
- crashkernel=64M:为crashkernel预留64M内存。
- crashkernel=512M-1G:64M,1G-:128M:当系统总内存在512M-1G,预留64M,当系统总内存大于1G,预留128M。
另外需要将宿主机的目录映射到qemu内,以便qemu内可以读取编译生成的kernel Image文件,也可以把产生的vmcore文件导出到host。这里映射了3个目录,其中第一个是 -fsdev local,id=share_comm,path=/data/eel/share,security_model=none -device virtio-9p-device,fsdev=share_comm,mount_tag=mnt_comm
。qemu内,可以通 过mount -t 9p share_comm /mnt
来挂载。
3.2. kexec加载内核
qemu启动linux后,需要用kexec加载crashkernel,注意 maxcpus=1 reset_devices
是必须的。
1kexec \
2 --append="console=ttyAMA0 rootfstype=ext4 rootwait root=/dev/vda rw maxcpus=1 reset_devices" \
3 -p /data/eel/images/arm64/kernel/linux-6.6/arm64_debug_defconfig/arch/arm64/boot/Image
上述命令如果没有输出,则表示执行成功。想要看到详细的信息,可以加上 -d
选项。
3.3. 触发panic
1echo c > /proc/sysrq-trigger
重启时的日志,可以看到panic后,自动启动新的内核。
1# echo c > /proc/sysrq-trigger
2[ 374.873622][ T160] sysrq: Trigger a crash
3[ 374.874185][ T160] Kernel panic - not syncing: sysrq triggered crash
4[ 374.874602][ T160] CPU: 0 PID: 160 Comm: sh Kdump: loaded Not tainted 6.6.0-g3149bc21d316-dirty #2 1c048aede95eceae4e303b1d563e246e55cb9a9d
5[ 374.875175][ T160] Hardware name: linux,dummy-virt (DT)
6[ 374.875392][ T160] Call trace:
7[ 374.875474][ T160] dump_backtrace+0x104/0x130
8[ 374.875674][ T160] show_stack+0x20/0x50
9[ 374.875723][ T160] dump_stack_lvl+0x90/0xbc
10[ 374.875847][ T160] dump_stack+0x18/0x34
11[ 374.875972][ T160] panic+0x1a0/0x38c
12[ 374.876039][ T160] sysrq_handle_crash+0x24/0x2c
13[ 374.876173][ T160] __handle_sysrq+0xdc/0x1a4
14[ 374.876231][ T160] write_sysrq_trigger+0x78/0xa0
15[ 374.876418][ T160] proc_reg_write+0xc8/0x100
16[ 374.876546][ T160] vfs_write+0x168/0x330
17[ 374.876724][ T160] ksys_write+0x70/0xfc
18[ 374.876906][ T160] __arm64_sys_write+0x24/0x30
19[ 374.877146][ T160] el0_svc_common.constprop.0+0xfc/0x1f4
20[ 374.877254][ T160] do_el0_svc+0xb4/0xc0
21[ 374.877339][ T160] el0_svc+0x48/0xa0
22[ 374.877420][ T160] el0t_64_sync_handler+0xc8/0x14c
23[ 374.877642][ T160] el0t_64_sync+0x19c/0x1a0
24[ 374.878124][ T160] SMP: stopping secondary CPUs
25[ 374.878981][ T160] Starting crashdump kernel...
26[ 374.879230][ T160] Bye!
27[ 0.000000][ T0] Booting Linux on physical CPU 0x0000000000 [0x411fd070]
28[ 0.000000][ T0] Linux version 6.6.0-g3149bc21d316-dirty (dix@EEL) (aarch64-linux-gnu-gcc (GCC) 11.3.1 20220604 [releases/gcc-11 revision 591c0f4b92548e3ae2e8173f4f93984b1c7f62bb], GNU ld (Linaro_Binutils-2022.06) 2.37.20220122) #2 SMP PREEMPT Tue Sep 19 01:09:13 CST 2023
29[ 0.000000][ T0] Machine model: linux,dummy-virt
30[ 0.000000][ T0] efi: UEFI not found.
31[ 0.000000][ T0] OF: fdt: Reserving 1 KiB of memory at 0x7ebff000 for elfcorehdr
新的内核是用kexec加载好的内核,如果用cat /proc/cpuinfo
查看的话,会显示只有一个核。
3.4. 保存vmcore
panic重启后生成的文件在 /proc/vmcore
,正常启动则没有这个文件。将vmcore保存到与宿主机共享的目录
1cat /proc/vmcore > /mnt/vmcore_arm64_6.6
4. Crash调试
进入ARM64 Docker使用Crash工具分析
1crash vmcore_arm64_6.6 /data/eel/output/arm64/kernel/linux-6.6/arm64_debug_defconfig/vmlinux
加载需要一点时间,完成后会显示如下信息。
1 KERNEL: /data/eel/output/arm64/kernel/linux-6.6/arm64_debug_defconfig/vmlinux
2 DUMPFILE: vmcore_arm64_6.6
3 CPUS: 2
4 DATE: Sat Sep 23 09:47:30 CST 2023
5 UPTIME: 00:06:14
6LOAD AVERAGE: 0.00, 0.00, 0.00
7 TASKS: 60
8 NODENAME: buildroot
9 RELEASE: 6.6.0-g3149bc21d316-dirty
10 VERSION: #2 SMP PREEMPT Tue Sep 19 01:09:13 CST 2023
11 MACHINE: aarch64 (unknown Mhz)
12 MEMORY: 1 GB
13 PANIC: "Kernel panic - not syncing: sysrq triggered crash"
14 PID: 160
15 COMMAND: "sh"
16 TASK: ffff65a602b68ec0 [THREAD_INFO: ffff65a602b68ec0]
17 CPU: 0
18 STATE: TASK_RUNNING (PANIC)
这里先简单使用 bt
来看一下panic时的调用栈,可以看到是sysrq触发了panic。
1crash> bt
2PID: 160 TASK: ffff65a602b68ec0 CPU: 0 COMMAND: "sh"
3 #0 [ffff800008413a10] machine_kexec at ffffd6678e834e98
4 #1 [ffff800008413a40] __crash_kexec at ffffd6678e9148bc
5 #2 [ffff800008413bd0] panic at ffffd6678f4f6a34
6 #3 [ffff800008413cb0] sysrq_handle_crash at ffffd6678eee25f4
7 #4 [ffff800008413cc0] __handle_sysrq at ffffd6678eee2e58
8 #5 [ffff800008413d10] write_sysrq_trigger at ffffd6678eee34b4
9 #6 [ffff800008413d30] proc_reg_write at ffffd6678eb88714
10 #7 [ffff800008413d50] vfs_write at ffffd6678eaf3c24
11 #8 [ffff800008413df0] ksys_write at ffffd6678eaf3f80
12 #9 [ffff800008413e30] __arm64_sys_write at ffffd6678eaf4030
13#10 [ffff800008413e40] el0_svc_common.constprop.0 at ffffd6678e829528
14#11 [ffff800008413e70] do_el0_svc at ffffd6678e8296d4
15#12 [ffff800008413e80] el0_svc at ffffd6678f5085c4
16#13 [ffff800008413ea0] el0t_64_sync_handler at ffffd6678f509948
17#14 [ffff800008413fe0] el0t_64_sync at ffffd6678e811de4
18 PC: 0000ffffb540c450 LR: 0000aaaad6f36968 SP: 0000fffff1d1fe50
19 X29: 0000fffff1d1fe50 X28: 0000aaaaf5de4700 X27: 0000000000000000
20 X26: 0000aaaaf5de46e0 X25: 0000000000000020 X24: 0000fffff1d1ff20
21 X23: 0000000000000001 X22: 0000ffffb55527a0 X21: 0000000000000002
22 X20: 0000aaaaf5deb100 X19: 0000000000000001 X18: 0000000000000001
23 X17: 0000ffffb540c420 X16: 0000aaaad6ff1948 X15: 65645f34366d7261
24 X14: 0000000000000001 X13: 0072656767697274 X12: 2d71727379732f63
25 X11: 2f686372612f6769 X10: 0000000000000000 X9: 0000000000000020
26 X8: 0000000000000040 X7: 7f7f7f7f7f7f7f7f X6: 0000000000000063
27 X5: fffffffffffffffe X4: 0000000000000001 X3: 0000ffffb5552010
28 X2: 0000000000000002 X1: 0000aaaaf5deb100 X0: 0000000000000001
29 ORIG_X0: 0000000000000001 SYSCALLNO: 40 PSTATE: 80000000