记一次因缓存过低导致zabbix异常退出的问题。
今天早上突然被告警吵醒,发现昨天晚上手机静音积压了好多异常日志,一看zabbix无了。
故障描述与分析 登上去发现zabbix的数据收集停留在昨天凌晨,来自本机的Agent2的数据也收不到,火速ssh到server。
考虑到本机agent2的数据都没收上来,于是查看agent2的日志,发现有大量如下信息:
1 2 3 4 5 6 7 8 9 10 11 12 root@zabbix:~# cat /var/log/zabbix/zabbix_agent2.log ... 2025/09/18 08:42:02.920216 Zabbix Agent2 hostname: [Zabbix server] 2025/09/19 21:47:32.001951 [101] cannot receive data from [127.0.0.1:10051]: Cannot read message: 'read tcp 127.0.0.1:47647->127.0.0.1:10051: i/o timeout' 2025/09/19 21:47:32.001981 [101] connection closed 2025/09/19 21:47:32.001985 [101] active check configuration update from host [Zabbix server] started to fail 2025/09/19 21:48:35.002751 [101] cannot receive data from [127.0.0.1:10051]: Cannot read message: 'read tcp 127.0.0.1:34643->127.0.0.1:10051: i/o timeout' 2025/09/19 21:48:35.002780 [101] connection closed 2025/09/19 21:48:35.002785 [101] active check configuration update from host [Zabbix server] started to fail 2025/09/19 21:49:38.002212 [101] cannot receive data from [127.0.0.1:10051]: Cannot read message: 'read tcp 127.0.0.1:55355->127.0.0.1:10051: i/o timeout' 2025/09/19 21:49:38.002249 [101] connection closed ...
看来是agent2无法连接到Server的10051端口,立刻想到查看一下10051端口上是否还有进程在监听(是否是zabbix进程挂了)。这不看不知道,一看真吓一跳,大量的10051端口的CLOSE_WAIT状态的连接:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 root@zabbix:~# netstat -anp | grep 10051 ... tcp 144 0 127.0.0.1:10051 127.0.0.1:57825 CLOSE_WAIT - tcp 144 0 127.0.0.1:10051 127.0.0.1:59825 CLOSE_WAIT - tcp 118 0 127.0.0.1:10051 127.0.0.1:54727 CLOSE_WAIT - tcp 144 0 127.0.0.1:10051 127.0.0.1:39359 CLOSE_WAIT - tcp 144 0 127.0.0.1:10051 127.0.0.1:38389 CLOSE_WAIT - tcp 0 0 127.0.0.1:34527 127.0.0.1:10051 FIN_WAIT2 - tcp 118 0 127.0.0.1:10051 127.0.0.1:60731 CLOSE_WAIT - tcp 118 0 127.0.0.1:10051 127.0.0.1:55711 CLOSE_WAIT - tcp 118 0 127.0.0.1:10051 127.0.0.1:41455 CLOSE_WAIT - tcp 144 0 127.0.0.1:10051 127.0.0.1:47245 CLOSE_WAIT - tcp 118 0 127.0.0.1:10051 127.0.0.1:56507 CLOSE_WAIT - tcp 118 0 127.0.0.1:10051 127.0.0.1:36807 CLOSE_WAIT - tcp 144 0 127.0.0.1:10051 127.0.0.1:34833 CLOSE_WAIT - ...
这说明 Agent2 的主动检查(active check)在向 Zabbix Server 取配置时,连接上了 10051,但 Server 端迟迟没有返回数据或没有正确关闭连接,导致 Agent2 等待超时,TCP 连接残留在 CLOSE_WAIT 状态。
查看zabbix-server服务状态,看到大量进程都terminated,基本上是全军覆没了:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 root@zabbix:~# systemctl status zabbix-server ● zabbix-server.service - Zabbix Server Loaded: loaded (/usr/lib/systemd/system/zabbix-server.service; enabled; preset: enabled) Active: active (running) since Fri 2025-09-19 06:39:12 UTC; 1 day 18h ago Process: 77363 ExecStart=/usr/sbin/zabbix_server -c $CONFFILE (code=exited, status=0/SUCCESS) Main PID: 77365 (zabbix_server) Tasks: 51 (limit: 19071) Memory: 146.5M (peak: 168.3M) CPU: 36min 24.962s CGroup: /system.slice/zabbix-server.service ├─77365 /usr/sbin/zabbix_server -c /etc/zabbix/zabbix_server.conf ├─77366 "/usr/sbin/zabbix_server: ha manager" ├─77367 "/usr/sbin/zabbix_server: service manager #1 [processed 0 events, updated 0 event tags, deleted 0 problems, synced 0 service updates, idle 5.005285 sec during 5. 005333 sec]" ├─77381 "/usr/sbin/zabbix_server: alert manager #1 [terminated]" ├─77385 "/usr/sbin/zabbix_server: preprocessing manager #1 [terminating]" ├─77386 "/usr/sbin/zabbix_server: lld manager #1 [terminated]" ├─77389 "/usr/sbin/zabbix_server: housekeeper #1 [terminated]" ├─77390 "/usr/sbin/zabbix_server: timer #1 [terminated]" ├─77391 "/usr/sbin/zabbix_server: http poller #1 [terminated]" ├─77395 "/usr/sbin/zabbix_server: browser poller #1 [terminated]" ├─77410 "/usr/sbin/zabbix_server: history syncer #1 [processed 2 values, 0+0 triggers in 0.000171 (0.000,0.000,0.000,0.000,0.000) sec, syncing history]" ├─77411 "/usr/sbin/zabbix_server: history syncer #2 [processed 0 values, 0+0 triggers in 0.000054 (0.000,0.000,0.000,0.000,0.000) sec, syncing history]" ├─77412 "/usr/sbin/zabbix_server: history syncer #3 [processed 1 values, 1+0 triggers in 0.006948 (0.007,0.000,0.000,0.000,0.000) sec, syncing history]" ├─77416 "/usr/sbin/zabbix_server: history syncer #4 [processed 0 values, 0+0 triggers in 0.000018 (0.000,0.000,0.000,0.000,0.000) sec, syncing history]" ├─77418 "/usr/sbin/zabbix_server: escalator #1 [terminated]" ├─77419 "/usr/sbin/zabbix_server: proxy poller #1 [terminated]" ├─77420 "/usr/sbin/zabbix_server: self-monitoring #1 [terminated]" ├─77421 "/usr/sbin/zabbix_server: task manager #1 [terminated]" ├─77423 "/usr/sbin/zabbix_server: poller #1 [terminated]" ├─77426 "/usr/sbin/zabbix_server: poller #2 [terminated]" ├─77431 "/usr/sbin/zabbix_server: poller #3 [terminated]" ├─77433 "/usr/sbin/zabbix_server: poller #4 [terminated]" ├─77434 "/usr/sbin/zabbix_server: poller #5 [terminated]" ├─77435 "/usr/sbin/zabbix_server: unreachable poller #1 [terminated]" ├─77436 "/usr/sbin/zabbix_server: trapper #1 [terminated]" ├─77437 "/usr/sbin/zabbix_server: trapper #2 [terminated]" ├─77438 "/usr/sbin/zabbix_server: trapper #3 [terminated]" ├─77439 "/usr/sbin/zabbix_server: trapper #4 [terminated]" ├─77441 "/usr/sbin/zabbix_server: trapper #5 [terminated]" ├─77442 "/usr/sbin/zabbix_server: icmp pinger #1 [pinging hosts]"
于是开始着手重启zabbix-server,但是stop的时候卡了5min仍未正确停止,于是对虚拟机进行了重启(这个处理方式实际上是不对的 ,重启本质上也是在关机过程中依次执行systemctl stop
,正确方式应该是使用 kill
杀掉对应进程 )。
1 2 3 4 5 6 7 8 root@zabbix:~# systemctl status zabbix-server ● zabbix-server.service - Zabbix Server Loaded: loaded (/usr/lib/systemd/system/zabbix-server.service; enabled; preset: enabled) Active: activating (auto-restart) (Result: exit-code) since Sun 2025-09-21 01:02:32 UTC; 4s ago Process: 1136 ExecStart=/usr/sbin/zabbix_server -c $CONFFILE (code=exited, status=0/SUCCESS) Process: 1143 ExecStop=/bin/sh -c [ -n "$1" ] && kill -s TERM "$1" -- $MAINPID (code=exited, status=1/FAILURE) Main PID: 1138 (code=exited, status=0/SUCCESS) CPU: 255ms
重启后zabbix-server服务仍不能启动,看来zabbix或者mysql的配置存在问题。此时想到最坏的结果可能是MySQL数据库无了需要回档。
解决不能启动的问题 根据现在的情况,对症治疗肯定是先看zabbix的日志:
1 2 3 4 5 6 7 8 9 10 11 root@zabbix:/var/log/zabbix# tail -f zabbix_server.log 1171:20250921:010307.392 3: /usr/sbin/zabbix_server: configuration syncer [syncing configuration](main+0x3ab) [0x5f346181e09b] 1171:20250921:010307.392 2: /lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca) [0x7aa21322a1ca] 1171:20250921:010307.392 1: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b) [0x7aa21322a28b] 1171:20250921:010307.392 0: /usr/sbin/zabbix_server: configuration syncer [syncing configuration](_start+0x25) [0x5f3461825235] 1171:20250921:010307.392 [file:dbconfig.c,line:247] __zbx_shmem_malloc(): out of memory (requested 32 bytes) 1171:20250921:010307.392 [file:dbconfig.c,line:247] __zbx_shmem_malloc(): please increase CacheSize configuration parameter 1168:20250921:010307.396 One child process died (PID:1171,exitcode/signal:1). Exiting ... 1169:20250921:010307.397 HA manager has been paused 1169:20250921:010308.920 HA manager has been stopped 1168:20250921:010308.921 Zabbix Server stopped. Zabbix 7.4.2 (revision 7aa4e0782fe).
可见其中有一行out of memory (requested 32 bytes)
,就很奇怪为什么会报内存不足,遂查看内存情况:
1 2 3 4 root@zabbix:/var/log/zabbix# free -h total used free shared buff/cache available Mem: 15Gi 1.1Gi 13Gi 16Mi 925Mi 14Gi Swap: 4.0Gi 0B 4.0Gi
可见内存充足,于是发现还有一行please increase CacheSize configuration parameter
,搜了一下这个CacheSize
是用于缓存当前所有监控项配置的,默认大小是32M
,该值过小会导致zabbix分配内存异常然后退出。
至此问题就很明朗了,提高CacheSize
即可。
打开/etc/zabbix/zabbix_server.conf
文件,找到如下行:
1 2 3 4 5 6 7 8 ### Option: CacheSize # Size of configuration cache, in bytes. # Shared memory size for storing host, item and trigger data. # # Mandatory: no # Range: 128K-64G # Default: CacheSize=32M
取消改行注释后将CacheSize
提升到4G一步到位,随后启动zabbix-server一切恢复正常。
1 systemctl daemon-reload && systemctl start zabbix-server
故障后其他检查 MySQL 刚说了最早也怀疑到MySQL的头上但最后发现跟人家没关系,那么按规范MySQL肯定也要例行检查一下:
看一下进程列表:
1 2 3 4 5 6 7 8 9 10 11 mysql> SHOW PROCESSLIST; +-----+-----------------+-----------+--------+---------+------+------------------------+------------------+ | Id | User | Host | db | Command | Time | State | Info | +-----+-----------------+-----------+--------+---------+------+------------------------+------------------+ | 5 | event_scheduler | localhost | NULL | Daemon | 861 | Waiting on empty queue | NULL | | 230 | zabbix | localhost | zabbix | Sleep | 1 | | NULL | ... | 261 | zabbix | localhost | zabbix | Sleep | 579 | | NULL | | 493 | root | localhost | NULL | Query | 0 | init | SHOW PROCESSLIST | +-----+-----------------+-----------+--------+---------+------+------------------------+------------------+ 33 rows in set, 1 warning (0.00 sec)
很多都是zabbix的sleep进程,Zabbix Server 为了性能会长时间保持连接,而不是频繁创建/销毁;同时 MySQL 默认wait_timeout
是28800
秒(即8小时),所以 Sleep 很久也不会自动断开。
再看一下活跃线程:
1 2 3 4 5 6 7 mysql> SHOW GLOBAL STATUS LIKE 'Threads_running'; +-----------------+-------+ | Variable_name | Value | +-----------------+-------+ | Threads_running | 2 | +-----------------+-------+ 1 row in set (0.01 sec)
只有两个真正在跑的线程,也很健康,没有活跃线程积压。
其他优化与复盘 MySQL 总体来看MySQL基本上没什么可优化的,根据经验应用了如下配置更改:
我的MySQL是直接下载的官网deb包 安装的。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 [mysqld] innodb_buffer_pool_size = 8 Ginnodb_buffer_pool_instances = 8 innodb_log_file_size = 1 G innodb_log_files_in_group = 2 innodb_flush_log_at_trx_commit = 2 innodb_flush_method = O_DIRECT innodb_io_capacity = 2000 innodb_io_capacity_max = 4000 max_connections = 1500 thread_cache_size = 100 table_open_cache = 4096 open_files_limit = 65535 tmp_table_size = 512 Mmax_heap_table_size = 512 Mjoin_buffer_size = 4 Msort_buffer_size = 4 Mread_rnd_buffer_size = 4 Mslow_query_log = 1 slow_query_log_file = /var/log/mysql/mysql-slow.loglong_query_time = 1
Zabbix 提升CacheSize
的同时也可以提升如下参数提高其他数据的性能:
1 2 3 HistoryCacheSize=4G TrendCacheSize=1G ValueCacheSize=128M
我这历史数据也比较庞大,所以HistoryCacheSize
适当做了提高。
后记 懒得写了,放个图吧