0%

【动态追踪】Perf Events

Perf Event是面向事件的观察工具。简要来说,Perf_Events可以用于解决下面的问题:

  • 为什么内核占用这么多CPU时间?具体是哪一个代码段耗时?
  • 哪部分代码导致CPU会导致L2的缓存失效?
  • 是否CPU被memory I/O所害?
  • 哪段代码在疯狂分配内存?
  • 到底谁导致了TCP的重传?
  • 是否内核中的某一个方法被调用了,有多频繁?

一共有两种Perf_Events:

  • 第一种是直接利用现有提供的Events,其覆盖了大部分可能会用到的事件
  • 如果刚好里面没有想要的,则需要利用第二种方式,自己写新的Perf_Events。

而针对如何如何利用Perf_Events进行测量,一共有三种方法:

  • 直接利用Counting Event计数,可以利用perf工具直接对发生的次数进行计数。这种方法不会生成perf.data文件,直接利用perf stat命令即可。
  • 在指定的时间进行取样,使用这种方法会将Perf_Events数据写到内核缓存里面,然后再由Perf隔一段时间写入perf.data文件中。最后利用perf report或者perf script读取。但是利用这种方法进行采样,report文件的大小overhead比较高(文件很大)。
  • 最后一种方法是利用BPF触发用户自己写的程序,这种方法最灵活。但是同时这种方法也比较复杂,需要自己写触发程序,在后面的eBPF章节中进行描述。

Events

如图所示,Linux Perf Events可以分为以下几类:

  • Hardware Events: CPU 性能检测计数器
  • Software Events: 利用kernel counters低层次的events,例如CPU迁移, minor faults, major faults.
  • Kernel Tracepoint Events: 有些内核态的tracepoint被硬件编码到内核中的一些地方。
  • User Statically-Defined Tracing (USDT): 用户态程序中的静态tracepoint。
  • Dynamic Tracing: 利用kprobe和uprobe在任意位置创建event。
  • Timed Profiling: 可以间隔一段时间时间进行快照。

目前perf支持的事件可以用 perf list 列举出来,如果使用动态追踪,可以追踪其他的事件。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# perf list
List of pre-defined events (to be used in -e):
cpu-cycles OR cycles [Hardware event]
instructions [Hardware event]
cache-references [Hardware event]
cache-misses [Hardware event]
branch-instructions OR branches [Hardware event]
branch-misses [Hardware event]
bus-cycles [Hardware event]
stalled-cycles-frontend OR idle-cycles-frontend [Hardware event]
stalled-cycles-backend OR idle-cycles-backend [Hardware event]
ref-cycles [Hardware event]
cpu-clock [Software event]
task-clock [Software event]
page-faults OR faults [Software event]
context-switches OR cs [Software event]
cpu-migrations OR migrations [Software event]
minor-faults [Software event]
major-faults [Software event]
alignment-faults [Software event]
emulation-faults [Software event]
L1-dcache-loads [Hardware cache event]
L1-dcache-load-misses [Hardware cache event]
L1-dcache-stores [Hardware cache event]
[...]
rNNN [Raw hardware event descriptor]
cpu/t1=v1[,t2=v2,t3 ...]/modifier [Raw hardware event descriptor]
(see 'man perf-list' on how to encode it)
mem:<addr>[:access] [Hardware breakpoint]
probe:tcp_sendmsg [Tracepoint event]
[...]
sched:sched_process_exec [Tracepoint event]
sched:sched_process_fork [Tracepoint event]
sched:sched_process_wait [Tracepoint event]
sched:sched_wait_task [Tracepoint event]
sched:sched_process_exit [Tracepoint event]
[...]
# perf list | wc -l
1580

Software Events

下面是一些 software event

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# perf list

List of pre-defined events (to be used in -e):

alignment-faults [Software event]
bpf-output [Software event]
context-switches OR cs [Software event]
cpu-clock [Software event]
cpu-migrations OR migrations [Software event]
dummy [Software event]
emulation-faults [Software event]
major-faults [Software event]
minor-faults [Software event]
page-faults OR faults [Software event]
task-clock [Software event]
[...]

你也可以通过 perf_event_open(2) 的 man page 查看这些事件:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
[...]
PERF_COUNT_SW_CPU_CLOCK
This reports the CPU clock, a high-resolution per-
CPU timer.

PERF_COUNT_SW_TASK_CLOCK
This reports a clock count specific to the task that
is running.

PERF_COUNT_SW_PAGE_FAULTS
This reports the number of page faults.

PERF_COUNT_SW_CONTEXT_SWITCHES
This counts context switches. Until Linux 2.6.34,
these were all reported as user-space events, after
that they are reported as happening in the kernel.

PERF_COUNT_SW_CPU_MIGRATIONS
This reports the number of times the process has
migrated to a new CPU.

PERF_COUNT_SW_PAGE_FAULTS_MIN
This counts the number of minor page faults. These
did not require disk I/O to handle.
[...]

内核也支持 tracepoints,这和 software events 非常像,但是有一个更加灵活的 API。

software event 有一个默认的周期,这意味着当你使用 perf 采样的时候,你获得的是所有事件的一个子集,而不是追踪每一个事件。使用 -vv 参数你可以看到 sample_period 参数,这些字段的含义可以在 perf_event_open(2) 的 man page 中看到。这里的含义是,内核会调整采样的速率使得每秒采样到 4000 次上下文切换。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
$ perf record -vv -e context-switches /bin/true
Using CPUID GenuineIntel-6-5E-3
nr_cblocks: 0
affinity: SYS
mmap flush: 1
comp level: 0
------------------------------------------------------------
perf_event_attr:
type 1
size 120
config 0x3
{ sample_period, sample_freq } 4000
sample_type IP|TID|TIME|PERIOD
read_format ID
disabled 1
inherit 1
mmap 1
comm 1
freq 1
enable_on_exec 1
task 1
sample_id_all 1
exclude_guest 1
mmap2 1
comm_exec 1
ksymbol 1
bpf_event 1
[...]

如果你想采样到所有事件,可以使用 -c 1 参数。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# perf record -vv -e context-switches -c 1 /bin/true
Using CPUID GenuineIntel-6-55
------------------------------------------------------------
perf_event_attr:
type 1
size 112
config 0x3
{ sample_period, sample_freq } 1
sample_type IP|TID|TIME
disabled 1
inherit 1
mmap 1
comm 1
enable_on_exec 1

Hardware Events

硬件事件是利用处理器的 performance Monitoring Unit(PMU) 实现,读取其中的 Performace Monitoring Counters(PMCs) 或者称为 Performance instrumentation counters(PICs) 。这些Counter可以跟踪一些底层的动作,如CPU cycles,instructions retired, memory stall cycles, level 2 cache misses等等。

PMCs are documented in the Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 3B: System Programming Guide, Part 2 and the BIOS and Kernel Developer’s Guide (BKDG) For AMD Family 10h Processors. There are thousands of different PMCs available.

这些硬件事件的特点是只有其中少数几个事件可以同时被记录。 这时因为硬件资源有限,需要手动指定它们记录哪些event。如何使用Hardware Events
使用硬件Raw Counter的格式是rUUEE,其中UU是umask,EE是event number。而大部分好用的直接加到了perf list中,直接用对应的事件就行。

Kernel Tracepoints

kernel tracepointshard coded 到内核中的追踪点,具体可以参考 我的这篇文章

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# perf list | awk -F: '/Tracepoint event/ { lib[$1]++ } END {
for (l in lib) { printf " %-16.16s %d\n", l, lib[l] } }' | sort | column
alarmtimer 4 i2c 8 page_isolation 1 swiotlb 1
block 19 iommu 7 pagemap 2 syscalls 614
btrfs 51 irq 5 power 22 task 2
cgroup 9 irq_vectors 22 printk 1 thermal 7
clk 14 jbd2 16 random 15 thermal_power_ 2
cma 2 kmem 12 ras 4 timer 13
compaction 14 libata 6 raw_syscalls 2 tlb 1
cpuhp 3 mce 1 rcu 1 udp 1
dma_fence 8 mdio 1 regmap 15 vmscan 15
exceptions 2 migrate 2 regulator 7 vsyscall 1
ext4 95 mmc 2 rpm 4 workqueue 4
fib 3 module 5 sched 24 writeback 30
fib6 1 mpx 5 scsi 5 x86_fpu 14
filelock 10 msr 3 sdt_node 1 xen 35
filemap 2 napi 1 signal 2 xfs 495
ftrace 1 net 10 skb 3 xhci-hcd 9
gpio 2 nmi 1 sock 2
huge_memory 4 oom 1 spi 7

这里面包括的 tracepoints 有:

  • block: block device I/O
  • ext4: file system operations
  • kmem: kernel memory allocation events
  • random: kernel random number generator events
  • sched: CPU scheduler events
  • syscalls: system call enter and exits
  • task: task events

USDT

kernel tracepoints,USDT(User Level Statically Defined Tracing) 也是被 hardcode 到用户态应用程序中,并且有一个稳定的API。很多用户程序已经支持 tracepoints 来支持 DTrace,但是有时候需要用户显式通过 --with-dtrace来编译:

1
2
3
4
5
6
$ sudo apt-get install systemtap-sdt-dev       # adds "dtrace", used by node build
$ wget https://nodejs.org/dist/v4.4.1/node-v4.4.1.tar.gz
$ tar xvf node-v4.4.1.tar.gz
$ cd node-v4.4.1
$ ./configure --with-dtrace
$ make -j 8

编译完成后可以查看可以探测的点:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
$ readelf -n node

Displaying notes found at file offset 0x00000254 with length 0x00000020:
Owner Data size Description
GNU 0x00000010 NT_GNU_ABI_TAG (ABI version tag)
OS: Linux, ABI: 2.6.32

Displaying notes found at file offset 0x00000274 with length 0x00000024:
Owner Data size Description
GNU 0x00000014 NT_GNU_BUILD_ID (unique build ID bitstring)
Build ID: 1e01659b0aecedadf297b2c56c4a2b536ae2308a

Displaying notes found at file offset 0x00e70994 with length 0x000003c4:
Owner Data size Description
stapsdt 0x0000003c NT_STAPSDT (SystemTap probe descriptors)
Provider: node
Name: gc__start
Location: 0x0000000000dc14e4, Base: 0x000000000112e064, Semaphore: 0x000000000147095c
Arguments: 4@%esi 4@%edx 8@%rdi
stapsdt 0x0000003b NT_STAPSDT (SystemTap probe descriptors)
Provider: node
Name: gc__done
Location: 0x0000000000dc14f4, Base: 0x000000000112e064, Semaphore: 0x000000000147095e
Arguments: 4@%esi 4@%edx 8@%rdi
stapsdt 0x00000067 NT_STAPSDT (SystemTap probe descriptors)
Provider: node
Name: http__server__response
Location: 0x0000000000dc1894, Base: 0x000000000112e064, Semaphore: 0x0000000001470956
Arguments: 8@%rax 8@-1144(%rbp) -4@-1148(%rbp) -4@-1152(%rbp)
stapsdt 0x00000061 NT_STAPSDT (SystemTap probe descriptors)
Provider: node
Name: net__stream__end
Location: 0x0000000000dc1c44, Base: 0x000000000112e064, Semaphore: 0x0000000001470952
Arguments: 8@%rax 8@-1144(%rbp) -4@-1148(%rbp) -4@-1152(%rbp)
stapsdt 0x00000068 NT_STAPSDT (SystemTap probe descriptors)
Provider: node
Name: net__server__connection
Location: 0x0000000000dc1ff4, Base: 0x000000000112e064, Semaphore: 0x0000000001470950
Arguments: 8@%rax 8@-1144(%rbp) -4@-1148(%rbp) -4@-1152(%rbp)
stapsdt 0x00000060 NT_STAPSDT (SystemTap probe descriptors)
Provider: node
Name: http__client__response
Location: 0x0000000000dc23c5, Base: 0x000000000112e064, Semaphore: 0x000000000147095a
Arguments: 8@%rdx 8@-1144(%rbp) -4@%eax -4@-1152(%rbp)
stapsdt 0x00000089 NT_STAPSDT (SystemTap probe descriptors)
Provider: node
Name: http__client__request
Location: 0x0000000000dc285e, Base: 0x000000000112e064, Semaphore: 0x0000000001470958
Arguments: 8@%rax 8@%rdx 8@-2184(%rbp) -4@-2188(%rbp) 8@-2232(%rbp) 8@-2240(%rbp) -4@-2192(%rbp)
stapsdt 0x00000089 NT_STAPSDT (SystemTap probe descriptors)
Provider: node
Name: http__server__request
Location: 0x0000000000dc2e69, Base: 0x000000000112e064, Semaphore: 0x0000000001470954
Arguments: 8@%r14 8@%rax 8@-4344(%rbp) -4@-4348(%rbp) 8@-4304(%rbp) 8@-4312(%rbp) -4@-4352(%rbp)

Dynamic Tracing

tracepointsdynamic tracing 的区别如下图所示:

  • tracepoint 是固定添加在某些位置的,往往比较稳定
  • dynamic tracing 可以看到所有的点,因为使用的是 raw code 所以不太稳定

Examples

CPU Statistics

perf stat 命令统计 PMC 相关参数,如下所示:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# perf stat gzip file1

Performance counter stats for 'gzip file1':

1920.159821 task-clock # 0.991 CPUs utilized
13 context-switches # 0.007 K/sec
0 CPU-migrations # 0.000 K/sec
258 page-faults # 0.134 K/sec
5,649,595,479 cycles # 2.942 GHz [83.43%]
1,808,339,931 stalled-cycles-frontend # 32.01% frontend cycles idle [83.54%]
1,171,884,577 stalled-cycles-backend # 20.74% backend cycles idle [66.77%]
8,625,207,199 instructions # 1.53 insns per cycle
# 0.21 stalled cycles per insn [83.51%]
1,488,797,176 branches # 775.351 M/sec [82.58%]
53,395,139 branch-misses # 3.59% of all branches [83.78%]

1.936842598 seconds time elapsed

这里面包含 IPC,也就是每个时钟周期执行的命令数,这是一个很常用的指标,IPC 值高(超过 1.0) 表明任务执行顺畅。但是也要注意执行的命令是什么,比如在一个 spin loop 里面,IPC 可能很高,但是实际执行的工作却很少。

通过 -d 可以获得 perf stat 更详细的命令输出:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# perf stat -d gzip file1

Performance counter stats for 'gzip file1':

1610.719530 task-clock # 0.998 CPUs utilized
20 context-switches # 0.012 K/sec
0 CPU-migrations # 0.000 K/sec
258 page-faults # 0.160 K/sec
5,491,605,997 cycles # 3.409 GHz [40.18%]
1,654,551,151 stalled-cycles-frontend # 30.13% frontend cycles idle [40.80%]
1,025,280,350 stalled-cycles-backend # 18.67% backend cycles idle [40.34%]
8,644,643,951 instructions # 1.57 insns per cycle
# 0.19 stalled cycles per insn [50.89%]
1,492,911,665 branches # 926.860 M/sec [50.69%]
53,471,580 branch-misses # 3.58% of all branches [51.21%]
1,938,889,736 L1-dcache-loads # 1203.741 M/sec [49.68%]
154,380,395 L1-dcache-load-misses # 7.96% of all L1-dcache hits [49.66%]
0 LLC-loads # 0.000 K/sec [39.27%]
0 LLC-load-misses # 0.00% of all LL-cache hits [39.61%]

1.614165346 seconds time elapsed

Timed Profiling

per_events 可以通过定期采样 instruction pointer来对 CPU usage 进行 profile,以下命令以 99 Hz 的频率对 CPU Stacks 采样:

1
2
3
4
5
# perf record -F 99 -a -g -- sleep 30
[ perf record: Woken up 9 times to write data ]
[ perf record: Captured and wrote 3.135 MB perf.data (~136971 samples) ]
# ls -lh perf.data
-rw------- 1 root root 3.2M Jan 26 07:26 perf.data

通过 perf report --stdio 命令可以将 perf.data 展示出来:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
# perf report --stdio
# ========
# captured on: Mon Jan 26 07:26:40 2014
# hostname : dev2
# os release : 3.8.6-ubuntu-12-opt
# perf version : 3.8.6
# arch : x86_64
# nrcpus online : 8
# nrcpus avail : 8
# cpudesc : Intel(R) Xeon(R) CPU X5675 @ 3.07GHz
# cpuid : GenuineIntel,6,44,2
# total memory : 8182008 kB
# cmdline : /usr/bin/perf record -F 99 -a -g -- sleep 30
# event : name = cpu-clock, type = 1, config = 0x0, config1 = 0x0, config2 = ...
# HEADER_CPU_TOPOLOGY info available, use -I to display
# HEADER_NUMA_TOPOLOGY info available, use -I to display
# pmu mappings: software = 1, breakpoint = 5
# ========
#
# Samples: 22K of event 'cpu-clock'
# Event count (approx.): 22751
#
# Overhead Command Shared Object Symbol
# ........ ....... ................. ...............................
#
94.12% dd [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
|
--- _raw_spin_unlock_irqrestore
|
|--96.67%-- extract_buf
| extract_entropy_user
| urandom_read
| vfs_read
| sys_read
| system_call_fastpath
| read
|
|--1.69%-- account
| |
| |--99.72%-- extract_entropy_user
| | urandom_read
| | vfs_read
| | sys_read
| | system_call_fastpath
| | read
| --0.28%-- [...]
|
|--1.60%-- mix_pool_bytes.constprop.17
[...]

Event Profiling

除了定期的采样,直接从 CPU 硬件计数器采样也是另一种 CPU profiling 的形式,下面列出了可以采样的事件,

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# perf list | grep Hardware
cpu-cycles OR cycles [Hardware event]
instructions [Hardware event]
cache-references [Hardware event]
cache-misses [Hardware event]
branch-instructions OR branches [Hardware event]
branch-misses [Hardware event]
bus-cycles [Hardware event]
stalled-cycles-frontend OR idle-cycles-frontend [Hardware event]
stalled-cycles-backend OR idle-cycles-backend [Hardware event]
ref-cycles [Hardware event]
L1-dcache-loads [Hardware cache event]
L1-dcache-load-misses [Hardware cache event]
L1-dcache-stores [Hardware cache event]
L1-dcache-store-misses [Hardware cache event]
[...]

这里面的很多事件,如果每次发生都采样将会带来巨大的开销,因此一般只是采样一小部分,可以通过 -c count 命令指定,比如:

1
# perf record -e L1-dcache-load-misses -c 10000 -ag -- sleep 5

Static Kernel Tracing

下面的例子显示了计算系统调用的次数,可以看到其中主要是 sys_enter_readsys_enter_write 的系统调用:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# perf stat -e 'syscalls:sys_enter_*' gzip file1 2>&1 | awk '$1 != 0'

Performance counter stats for 'gzip file1':

1 syscalls:sys_enter_utimensat
1 syscalls:sys_enter_unlink
5 syscalls:sys_enter_newfstat
1,603 syscalls:sys_enter_read
3,201 syscalls:sys_enter_write
5 syscalls:sys_enter_access
1 syscalls:sys_enter_fchmod
1 syscalls:sys_enter_fchown
6 syscalls:sys_enter_open
9 syscalls:sys_enter_close
8 syscalls:sys_enter_mprotect
1 syscalls:sys_enter_brk
1 syscalls:sys_enter_munmap
1 syscalls:sys_enter_set_robust_list
1 syscalls:sys_enter_futex
1 syscalls:sys_enter_getrlimit
5 syscalls:sys_enter_rt_sigprocmask
14 syscalls:sys_enter_rt_sigaction
1 syscalls:sys_enter_exit_group
1 syscalls:sys_enter_set_tid_address
14 syscalls:sys_enter_mmap

1.543990940 seconds time elapsed

另一种方法是使用 strace -c ,但是对比 perfstraceperf 带来的性能开销要小得多:

1
2
3
4
5
6
7
8
# dd if=/dev/zero of=/dev/null bs=512 count=10000k
5242880000 bytes (5.2 GB) copied, 3.53031 s, 1.5 GB/s

# perf stat -e 'syscalls:sys_enter_*' dd if=/dev/zero of=/dev/null bs=512 count=10000k
5242880000 bytes (5.2 GB) copied, 9.14225 s, 573 MB/s

# strace -c dd if=/dev/zero of=/dev/null bs=512 count=10000k
5242880000 bytes (5.2 GB) copied, 218.915 s, 23.9 MB/s

Static User Tracing

这里是追踪 Node.js 的 USDT Probe 的例子:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# perf buildid-cache --add `which node`
# perf list | grep sdt_node
sdt_node:gc__done [SDT event]
sdt_node:gc__start [SDT event]
sdt_node:http__client__request [SDT event]
sdt_node:http__client__response [SDT event]
sdt_node:http__server__request [SDT event]
sdt_node:http__server__response [SDT event]
sdt_node:net__server__connection [SDT event]
sdt_node:net__stream__end [SDT event]
# perf record -e sdt_node:http__server__request -a
^C[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.446 MB perf.data (3 samples) ]
# perf script
node 7646 [002] 361.012364: sdt_node:http__server__request: (dc2e69)
node 7646 [002] 361.204718: sdt_node:http__server__request: (dc2e69)
node 7646 [002] 361.363043: sdt_node:http__server__request: (dc2e69)

Dynamic Tracing

通过下列命令添加 tcp_sendmsg 的 tracepoints event:

1
2
3
4
5
6
7
8
# perf probe --add tcp_sendmsg
Failed to find path of kernel module.
Added new event:
probe:tcp_sendmsg (on tcp_sendmsg)

You can now use it in all perf tools, such as:

perf record -e probe:tcp_sendmsg -aR sleep 1

接下来使用 perf record 采样:

1
2
3
# perf record -e probe:tcp_sendmsg -a -g -- sleep 5
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.228 MB perf.data (~9974 samples) ]

显示报告如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
# perf report --stdio
# ========
# captured on: Fri Jan 31 20:10:14 2014
# hostname : pgbackup
# os release : 3.9.3-ubuntu-12-opt
# perf version : 3.9.3
# arch : x86_64
# nrcpus online : 8
# nrcpus avail : 8
# cpudesc : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
# cpuid : GenuineIntel,6,45,7
# total memory : 8179104 kB
# cmdline : /lib/modules/3.9.3/build/tools/perf/perf record -e probe:tcp_sendmsg -a -g -- sleep 5
# event : name = probe:tcp_sendmsg, type = 2, config = 0x3b2, config1 = 0x0, config2 = 0x0, ...
# HEADER_CPU_TOPOLOGY info available, use -I to display
# HEADER_NUMA_TOPOLOGY info available, use -I to display
# pmu mappings: software = 1, tracepoint = 2, breakpoint = 5
# ========
#
# Samples: 12 of event 'probe:tcp_sendmsg'
# Event count (approx.): 12
#
# Overhead Command Shared Object Symbol
# ........ ....... ................. ...............
#
100.00% sshd [kernel.kallsyms] [k] tcp_sendmsg
|
--- tcp_sendmsg
sock_aio_write
do_sync_write
vfs_write
sys_write
system_call_fastpath
__write_nocancel
|
|--8.33%-- 0x50f00000001b810
--91.67%-- [...]

Scheduler Analysis

通过 perf sched 命令可以对内核调度器进行分析:

1
2
3
# perf sched record -- sleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 1.886 MB perf.data (13502 samples) ]

通过 perf script 可以查看详细的调度事件,此处暂时不具体分析。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# perf script --header
# ========
# captured on: Sun Feb 26 19:40:00 2017
# hostname : bgregg-xenial
# os release : 4.10-virtual
# perf version : 4.10
# arch : x86_64
# nrcpus online : 8
# nrcpus avail : 8
# cpudesc : Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
# cpuid : GenuineIntel,6,62,4
# total memory : 15401700 kB
# cmdline : /usr/bin/perf sched record -- sleep 1
# event : name = sched:sched_switch, , id = { 2752, 2753, 2754, 2755, 2756, 2757, 2758, 2759 }, type = 2, size = 11...
# event : name = sched:sched_stat_wait, , id = { 2760, 2761, 2762, 2763, 2764, 2765, 2766, 2767 }, type = 2, size =...
# event : name = sched:sched_stat_sleep, , id = { 2768, 2769, 2770, 2771, 2772, 2773, 2774, 2775 }, type = 2, size ...
# event : name = sched:sched_stat_iowait, , id = { 2776, 2777, 2778, 2779, 2780, 2781, 2782, 2783 }, type = 2, size...
# event : name = sched:sched_stat_runtime, , id = { 2784, 2785, 2786, 2787, 2788, 2789, 2790, 2791 }, type = 2, siz...
# event : name = sched:sched_process_fork, , id = { 2792, 2793, 2794, 2795, 2796, 2797, 2798, 2799 }, type = 2, siz...
# event : name = sched:sched_wakeup, , id = { 2800, 2801, 2802, 2803, 2804, 2805, 2806, 2807 }, type = 2, size = 11...
# event : name = sched:sched_wakeup_new, , id = { 2808, 2809, 2810, 2811, 2812, 2813, 2814, 2815 }, type = 2, size ...
# event : name = sched:sched_migrate_task, , id = { 2816, 2817, 2818, 2819, 2820, 2821, 2822, 2823 }, type = 2, siz...
# HEADER_CPU_TOPOLOGY info available, use -I to display
# HEADER_NUMA_TOPOLOGY info available, use -I to display
# pmu mappings: breakpoint = 5, power = 7, software = 1, tracepoint = 2, msr = 6
# HEADER_CACHE info available, use -I to display
# missing features: HEADER_BRANCH_STACK HEADER_GROUP_DESC HEADER_AUXTRACE HEADER_STAT
# ========
#
perf 16984 [005] 991962.879966: sched:sched_wakeup: comm=perf pid=16999 prio=120 target_cpu=005
[...]

eBPF

使用BPF可以更加灵活地指定或筛选自己需要用到的内容。利用eBPF的方法简要描述如下:

  • /proc/kallsyms中找到想要追踪的地址。
  • 写自己的BPF程序:
1
2
3
4
5
6
7
8
9
10
int func(struct pt_regs *ctx) {
u64 ret = 0;
// x86_64 specific:
probe_read(&ret, sizeof(ret), (void *)(ctx->bp+8));
if (ret >= RANGE_START && ret < RANGE_END) {
perf_event_output(ctx, &channel, get_smp_processor_id(),
&ret, sizeof(ret));
}
return 0;
}
  • 直接用perf指定用自己的程序进行trace

    1
    # perf record -e bpf-output/no-inherit,name=evt/ -e ./kca_from.c/map:channel.event=evt/ -a -- sleep 1

参考资料