Lxn-Chan!

(~ ̄▽ ̄)→))* ̄▽ ̄*)o

记一次因缓存过低导致zabbix异常退出的问题。

今天早上突然被告警吵醒,发现昨天晚上手机静音积压了好多异常日志,一看zabbix无了。

故障描述与分析

登上去发现zabbix的数据收集停留在昨天凌晨,来自本机的Agent2的数据也收不到,火速ssh到server。

考虑到本机agent2的数据都没收上来,于是查看agent2的日志,发现有大量如下信息:

1
2
3
4
5
6
7
8
9
10
11
12
root@zabbix:~# cat /var/log/zabbix/zabbix_agent2.log
...
2025/09/18 08:42:02.920216 Zabbix Agent2 hostname: [Zabbix server]
2025/09/19 21:47:32.001951 [101] cannot receive data from [127.0.0.1:10051]: Cannot read message: 'read tcp 127.0.0.1:47647->127.0.0.1:10051: i/o timeout'
2025/09/19 21:47:32.001981 [101] connection closed
2025/09/19 21:47:32.001985 [101] active check configuration update from host [Zabbix server] started to fail
2025/09/19 21:48:35.002751 [101] cannot receive data from [127.0.0.1:10051]: Cannot read message: 'read tcp 127.0.0.1:34643->127.0.0.1:10051: i/o timeout'
2025/09/19 21:48:35.002780 [101] connection closed
2025/09/19 21:48:35.002785 [101] active check configuration update from host [Zabbix server] started to fail
2025/09/19 21:49:38.002212 [101] cannot receive data from [127.0.0.1:10051]: Cannot read message: 'read tcp 127.0.0.1:55355->127.0.0.1:10051: i/o timeout'
2025/09/19 21:49:38.002249 [101] connection closed
...

看来是agent2无法连接到Server的10051端口,立刻想到查看一下10051端口上是否还有进程在监听(是否是zabbix进程挂了)。这不看不知道,一看真吓一跳,大量的10051端口的CLOSE_WAIT状态的连接:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
root@zabbix:~# netstat -anp | grep 10051
...
tcp 144 0 127.0.0.1:10051 127.0.0.1:57825 CLOSE_WAIT -
tcp 144 0 127.0.0.1:10051 127.0.0.1:59825 CLOSE_WAIT -
tcp 118 0 127.0.0.1:10051 127.0.0.1:54727 CLOSE_WAIT -
tcp 144 0 127.0.0.1:10051 127.0.0.1:39359 CLOSE_WAIT -
tcp 144 0 127.0.0.1:10051 127.0.0.1:38389 CLOSE_WAIT -
tcp 0 0 127.0.0.1:34527 127.0.0.1:10051 FIN_WAIT2 -
tcp 118 0 127.0.0.1:10051 127.0.0.1:60731 CLOSE_WAIT -
tcp 118 0 127.0.0.1:10051 127.0.0.1:55711 CLOSE_WAIT -
tcp 118 0 127.0.0.1:10051 127.0.0.1:41455 CLOSE_WAIT -
tcp 144 0 127.0.0.1:10051 127.0.0.1:47245 CLOSE_WAIT -
tcp 118 0 127.0.0.1:10051 127.0.0.1:56507 CLOSE_WAIT -
tcp 118 0 127.0.0.1:10051 127.0.0.1:36807 CLOSE_WAIT -
tcp 144 0 127.0.0.1:10051 127.0.0.1:34833 CLOSE_WAIT -
...

这说明 Agent2 的主动检查(active check)在向 Zabbix Server 取配置时,连接上了 10051,但 Server 端迟迟没有返回数据或没有正确关闭连接,导致 Agent2 等待超时,TCP 连接残留在 CLOSE_WAIT 状态。

查看zabbix-server服务状态,看到大量进程都terminated,基本上是全军覆没了:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
root@zabbix:~# systemctl status zabbix-server
● zabbix-server.service - Zabbix Server
Loaded: loaded (/usr/lib/systemd/system/zabbix-server.service; enabled; preset: enabled)
Active: active (running) since Fri 2025-09-19 06:39:12 UTC; 1 day 18h ago
Process: 77363 ExecStart=/usr/sbin/zabbix_server -c $CONFFILE (code=exited, status=0/SUCCESS)
Main PID: 77365 (zabbix_server)
Tasks: 51 (limit: 19071)
Memory: 146.5M (peak: 168.3M)
CPU: 36min 24.962s
CGroup: /system.slice/zabbix-server.service
├─77365 /usr/sbin/zabbix_server -c /etc/zabbix/zabbix_server.conf
├─77366 "/usr/sbin/zabbix_server: ha manager"
├─77367 "/usr/sbin/zabbix_server: service manager #1 [processed 0 events, updated 0 event tags, deleted 0 problems, synced 0 service updates, idle 5.005285 sec during 5.
005333 sec]"
├─77381 "/usr/sbin/zabbix_server: alert manager #1 [terminated]"
├─77385 "/usr/sbin/zabbix_server: preprocessing manager #1 [terminating]"
├─77386 "/usr/sbin/zabbix_server: lld manager #1 [terminated]"
├─77389 "/usr/sbin/zabbix_server: housekeeper #1 [terminated]"
├─77390 "/usr/sbin/zabbix_server: timer #1 [terminated]"
├─77391 "/usr/sbin/zabbix_server: http poller #1 [terminated]"
├─77395 "/usr/sbin/zabbix_server: browser poller #1 [terminated]"
├─77410 "/usr/sbin/zabbix_server: history syncer #1 [processed 2 values, 0+0 triggers in 0.000171 (0.000,0.000,0.000,0.000,0.000) sec, syncing history]"
├─77411 "/usr/sbin/zabbix_server: history syncer #2 [processed 0 values, 0+0 triggers in 0.000054 (0.000,0.000,0.000,0.000,0.000) sec, syncing history]"
├─77412 "/usr/sbin/zabbix_server: history syncer #3 [processed 1 values, 1+0 triggers in 0.006948 (0.007,0.000,0.000,0.000,0.000) sec, syncing history]"
├─77416 "/usr/sbin/zabbix_server: history syncer #4 [processed 0 values, 0+0 triggers in 0.000018 (0.000,0.000,0.000,0.000,0.000) sec, syncing history]"
├─77418 "/usr/sbin/zabbix_server: escalator #1 [terminated]"
├─77419 "/usr/sbin/zabbix_server: proxy poller #1 [terminated]"
├─77420 "/usr/sbin/zabbix_server: self-monitoring #1 [terminated]"
├─77421 "/usr/sbin/zabbix_server: task manager #1 [terminated]"
├─77423 "/usr/sbin/zabbix_server: poller #1 [terminated]"
├─77426 "/usr/sbin/zabbix_server: poller #2 [terminated]"
├─77431 "/usr/sbin/zabbix_server: poller #3 [terminated]"
├─77433 "/usr/sbin/zabbix_server: poller #4 [terminated]"
├─77434 "/usr/sbin/zabbix_server: poller #5 [terminated]"
├─77435 "/usr/sbin/zabbix_server: unreachable poller #1 [terminated]"
├─77436 "/usr/sbin/zabbix_server: trapper #1 [terminated]"
├─77437 "/usr/sbin/zabbix_server: trapper #2 [terminated]"
├─77438 "/usr/sbin/zabbix_server: trapper #3 [terminated]"
├─77439 "/usr/sbin/zabbix_server: trapper #4 [terminated]"
├─77441 "/usr/sbin/zabbix_server: trapper #5 [terminated]"
├─77442 "/usr/sbin/zabbix_server: icmp pinger #1 [pinging hosts]"

于是开始着手重启zabbix-server,但是stop的时候卡了5min仍未正确停止,于是对虚拟机进行了重启(这个处理方式实际上是不对的,重启本质上也是在关机过程中依次执行systemctl stop正确方式应该是使用kill杀掉对应进程)。

1
2
3
4
5
6
7
8
root@zabbix:~# systemctl status zabbix-server
● zabbix-server.service - Zabbix Server
Loaded: loaded (/usr/lib/systemd/system/zabbix-server.service; enabled; preset: enabled)
Active: activating (auto-restart) (Result: exit-code) since Sun 2025-09-21 01:02:32 UTC; 4s ago
Process: 1136 ExecStart=/usr/sbin/zabbix_server -c $CONFFILE (code=exited, status=0/SUCCESS)
Process: 1143 ExecStop=/bin/sh -c [ -n "$1" ] && kill -s TERM "$1" -- $MAINPID (code=exited, status=1/FAILURE)
Main PID: 1138 (code=exited, status=0/SUCCESS)
CPU: 255ms

重启后zabbix-server服务仍不能启动,看来zabbix或者mysql的配置存在问题。此时想到最坏的结果可能是MySQL数据库无了需要回档。

解决不能启动的问题

根据现在的情况,对症治疗肯定是先看zabbix的日志:

1
2
3
4
5
6
7
8
9
10
11
root@zabbix:/var/log/zabbix# tail -f zabbix_server.log 
1171:20250921:010307.392 3: /usr/sbin/zabbix_server: configuration syncer [syncing configuration](main+0x3ab) [0x5f346181e09b]
1171:20250921:010307.392 2: /lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca) [0x7aa21322a1ca]
1171:20250921:010307.392 1: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b) [0x7aa21322a28b]
1171:20250921:010307.392 0: /usr/sbin/zabbix_server: configuration syncer [syncing configuration](_start+0x25) [0x5f3461825235]
1171:20250921:010307.392 [file:dbconfig.c,line:247] __zbx_shmem_malloc(): out of memory (requested 32 bytes)
1171:20250921:010307.392 [file:dbconfig.c,line:247] __zbx_shmem_malloc(): please increase CacheSize configuration parameter
1168:20250921:010307.396 One child process died (PID:1171,exitcode/signal:1). Exiting ...
1169:20250921:010307.397 HA manager has been paused
1169:20250921:010308.920 HA manager has been stopped
1168:20250921:010308.921 Zabbix Server stopped. Zabbix 7.4.2 (revision 7aa4e0782fe).

可见其中有一行out of memory (requested 32 bytes),就很奇怪为什么会报内存不足,遂查看内存情况:

1
2
3
4
root@zabbix:/var/log/zabbix# free -h
total used free shared buff/cache available
Mem: 15Gi 1.1Gi 13Gi 16Mi 925Mi 14Gi
Swap: 4.0Gi 0B 4.0Gi

可见内存充足,于是发现还有一行please increase CacheSize configuration parameter,搜了一下这个CacheSize是用于缓存当前所有监控项配置的,默认大小是32M,该值过小会导致zabbix分配内存异常然后退出。

至此问题就很明朗了,提高CacheSize即可。

打开/etc/zabbix/zabbix_server.conf文件,找到如下行:

1
2
3
4
5
6
7
8
### Option: CacheSize
# Size of configuration cache, in bytes.
# Shared memory size for storing host, item and trigger data.
#
# Mandatory: no
# Range: 128K-64G
# Default:
CacheSize=32M

取消改行注释后将CacheSize提升到4G一步到位,随后启动zabbix-server一切恢复正常。

1
systemctl daemon-reload && systemctl start zabbix-server

故障后其他检查

MySQL

刚说了最早也怀疑到MySQL的头上但最后发现跟人家没关系,那么按规范MySQL肯定也要例行检查一下:

看一下进程列表:

1
2
3
4
5
6
7
8
9
10
11
mysql> SHOW PROCESSLIST;
+-----+-----------------+-----------+--------+---------+------+------------------------+------------------+
| Id | User | Host | db | Command | Time | State | Info |
+-----+-----------------+-----------+--------+---------+------+------------------------+------------------+
| 5 | event_scheduler | localhost | NULL | Daemon | 861 | Waiting on empty queue | NULL |
| 230 | zabbix | localhost | zabbix | Sleep | 1 | | NULL |
...
| 261 | zabbix | localhost | zabbix | Sleep | 579 | | NULL |
| 493 | root | localhost | NULL | Query | 0 | init | SHOW PROCESSLIST |
+-----+-----------------+-----------+--------+---------+------+------------------------+------------------+
33 rows in set, 1 warning (0.00 sec)

很多都是zabbix的sleep进程,Zabbix Server 为了性能会长时间保持连接,而不是频繁创建/销毁;同时 MySQL 默认wait_timeout28800秒(即8小时),所以 Sleep 很久也不会自动断开。

再看一下活跃线程:

1
2
3
4
5
6
7
mysql> SHOW GLOBAL STATUS LIKE 'Threads_running';
+-----------------+-------+
| Variable_name | Value |
+-----------------+-------+
| Threads_running | 2 |
+-----------------+-------+
1 row in set (0.01 sec)

只有两个真正在跑的线程,也很健康,没有活跃线程积压。

其他优化与复盘

MySQL

总体来看MySQL基本上没什么可优化的,根据经验应用了如下配置更改:

我的MySQL是直接下载的官网deb包安装的。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
[mysqld]
# 建议占物理内存 40~60%
innodb_buffer_pool_size = 8G
# 与 buffer_pool_size 每 1G~2G 对应 1 个实例
innodb_buffer_pool_instances = 8
# 大事务写入压力大时可加大
innodb_log_file_size = 1G
innodb_log_files_in_group = 2
# 写性能 vs 数据安全的平衡,Zabbix 常用 2
innodb_flush_log_at_trx_commit = 2
# 避免双缓存,提升 IO 性能
innodb_flush_method = O_DIRECT
# 根据磁盘性能设置 (SSD 可更高)
innodb_io_capacity = 2000
innodb_io_capacity_max = 4000
# 建议 Zabbix 连接数峰值的 2 倍
max_connections = 1500
thread_cache_size = 100
# 下面两个调大可以减少表打开/关闭的开销
table_open_cache = 4096
open_files_limit = 65535
# 查询与写入优化
tmp_table_size = 512M
max_heap_table_size = 512M
join_buffer_size = 4M
sort_buffer_size = 4M
read_rnd_buffer_size = 4M
# 启用慢查询日志,方便后续分析
slow_query_log = 1
slow_query_log_file = /var/log/mysql/mysql-slow.log
long_query_time = 1

Zabbix

提升CacheSize的同时也可以提升如下参数提高其他数据的性能:

1
2
3
HistoryCacheSize=4G
TrendCacheSize=1G
ValueCacheSize=128M

我这历史数据也比较庞大,所以HistoryCacheSize适当做了提高。

后记

懒得写了,放个图吧

 简单说两句



联系站长 | 服务状态 | 友情链接

备案号:辽ICP备19013963号

津公网安备12011602300394号

中国互联网违法和不良信息举报中心

架构版本号:8.1.7 | 本站已全面支持IPv6

正在载入运行数据(1/2)请稍后...
正在载入运行数据(2/2)请稍后...

♥stand with innovative technologies of all kinds♥

Copyright 2024 LingXuanNing, All rights reserved.