侯体宗的博客
  • 首页
  • Hyperf版
  • beego仿版
  • 人生(杂谈)
  • 技术
  • 关于我
  • 更多分类
    • 文件下载
    • 文字修仙
    • 中国象棋ai
    • 群聊
    • 九宫格抽奖
    • 拼图
    • 消消乐
    • 相册

详细分析Redis集群故障

Redis  /  管理员 发布于 7年前   130

故障表象:

业务层面显示提示查询redis失败

集群组成:

3主3从,每个节点的数据有8GB

机器分布:

在同一个机架中,

xx.x.xxx.199
xx.x.xxx.200
xx.x.xxx.201

redis-server进程状态:

通过命令ps -eo pid,lstart | grep $pid,

发现进程已经持续运行了3个月

发生故障前集群的节点状态:

xx.x.xxx.200:8371(bedab2c537fe94f8c0363ac4ae97d56832316e65) master
xx.x.xxx.199:8373(792020fe66c00ae56e27cd7a048ba6bb2b67adb6) slave
xx.x.xxx.201:8375(5ab4f85306da6d633e4834b4d3327f45af02171b) master
xx.x.xxx.201:8372(826607654f5ec81c3756a4a21f357e644efe605a) slave
xx.x.xxx.199:8370(462cadcb41e635d460425430d318f2fe464665c5) master
xx.x.xxx.200:8374(1238085b578390f3c8efa30824fd9a4baba10ddf) slave

---------------------------------下面是日志分析--------------------------------------

步1:
主节点8371失去和从节点8373的连接:
46590:M 09 Sep 18:57:51.379 # Connection with slave xx.x.xxx.199:8373 lost.

步2:
主节点8370/8375判定8371失联:
42645:M 09 Sep 18:57:50.117 * Marking node bedab2c537fe94f8c0363ac4ae97d56832316e65 as failing (quorum reached).

步3:
从节点8372/8373/8374收到主节点8375说8371失联:
46986:S 09 Sep 18:57:50.120 * FAIL message received from 5ab4f85306da6d633e4834b4d3327f45af02171b about bedab2c537fe94f8c0363ac4ae97d56832316e65

步4:
主节点8370/8375授权8373升级为主节点转移:
42645:M 09 Sep 18:57:51.055 # Failover auth granted to 792020fe66c00ae56e27cd7a048ba6bb2b67adb6 for epoch 16

步5:
原主节点8371修改自己的配置,成为8373的从节点:
46590:M 09 Sep 18:57:51.488 # Configuration change detected. Reconfiguring myself as a replica of 792020fe66c00ae56e27cd7a048ba6bb2b67adb6

步6:
主节点8370/8375/8373明确8371失败状态:
42645:M 09 Sep 18:57:51.522 * Clear FAIL state for node bedab2c537fe94f8c0363ac4ae97d56832316e65: master without slots is reachable again.

步7:
新从节点8371开始从新主节点8373,第一次全量同步数据:
8373日志::
4255:M 09 Sep 18:57:51.906 * Full resync requested by slave xx.x.xxx.200:8371
4255:M 09 Sep 18:57:51.906 * Starting BGSAVE for SYNC with target: disk
4255:M 09 Sep 18:57:51.941 * Background saving started by pid 5230
8371日志::
46590:S 09 Sep 18:57:51.948 * Full resync from master: d7751c4ebf1e63d3baebea1ed409e0e7243a4423:440721826993

步8:
主节点8370/8375判定8373(新主)失联:
42645:M 09 Sep 18:58:00.320 * Marking node 792020fe66c00ae56e27cd7a048ba6bb2b67adb6 as failing (quorum reached).

步9:
主节点8370/8375判定8373(新主)恢复:
60295:M 09 Sep 18:58:18.181 * Clear FAIL state for node 792020fe66c00ae56e27cd7a048ba6bb2b67adb6: is reachable again and nobody is serving its slots after some time.

步10:
主节点8373完成全量同步所需要的BGSAVE操作:
5230:C 09 Sep 18:59:01.474 * DB saved on disk
5230:C 09 Sep 18:59:01.491 * RDB: 7112 MB of memory used by copy-on-write
4255:M 09 Sep 18:59:01.877 * Background saving terminated with success

步11:
从节点8371开始从主节点8373接收到数据:
46590:S 09 Sep 18:59:02.263 * MASTER <-> SLAVE sync: receiving 2657606930 bytes from master

步12:
主节点8373发现从节点8371对output buffer作了限制:
4255:M 09 Sep 19:00:19.014 # Client id=14259015 addr=xx.x.xxx.200:21772 fd=844 name= age=148 idle=148 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=16349 oll=4103 omem=95944066 events=rw cmd=psync scheduled to be closed ASAP for overcoming of output buffer limits.
4255:M 09 Sep 19:00:19.015 # Connection with slave xx.x.xxx.200:8371 lost.

步13:
从节点8371从主节点8373同步数据失败,连接断了,第一次全量同步失败:
46590:S 09 Sep 19:00:19.018 # I/O error trying to sync with MASTER: connection lost
46590:S 09 Sep 19:00:20.102 * Connecting to MASTER xx.x.xxx.199:8373
46590:S 09 Sep 19:00:20.102 * MASTER <-> SLAVE sync started

步14:
从节点8371重新开始同步,连接失败,主节点8373的连接数满了:
46590:S 09 Sep 19:00:21.103 * Connecting to MASTER xx.x.xxx.199:8373
46590:S 09 Sep 19:00:21.103 * MASTER <-> SLAVE sync started
46590:S 09 Sep 19:00:21.104 * Non blocking connect for SYNC fired the event.
46590:S 09 Sep 19:00:21.104 # Error reply to PING from master: '-ERR max number of clients reached'

步15:
从节点8371重新连上主节点8373,第二次开始全量同步:
8371日志:
46590:S 09 Sep 19:00:49.175 * Connecting to MASTER xx.x.xxx.199:8373
46590:S 09 Sep 19:00:49.175 * MASTER <-> SLAVE sync started
46590:S 09 Sep 19:00:49.175 * Non blocking connect for SYNC fired the event.
46590:S 09 Sep 19:00:49.176 * Master replied to PING, replication can continue...
46590:S 09 Sep 19:00:49.179 * Partial resynchronization not possible (no cached master)
46590:S 09 Sep 19:00:49.501 * Full resync from master: d7751c4ebf1e63d3baebea1ed409e0e7243a4423:440780763454
8373日志:
4255:M 09 Sep 19:00:49.176 * Slave xx.x.xxx.200:8371 asks for synchronization
4255:M 09 Sep 19:00:49.176 * Full resync requested by slave xx.x.xxx.200:8371
4255:M 09 Sep 19:00:49.176 * Starting BGSAVE for SYNC with target: disk
4255:M 09 Sep 19:00:49.498 * Background saving started by pid 18413
18413:C 09 Sep 19:01:52.466 * DB saved on disk
18413:C 09 Sep 19:01:52.620 * RDB: 2124 MB of memory used by copy-on-write
4255:M 09 Sep 19:01:53.186 * Background saving terminated with success

步16:
从节点8371同步数据成功,开始加载经内存:
46590:S 09 Sep 19:01:53.190 * MASTER <-> SLAVE sync: receiving 2637183250 bytes from master
46590:S 09 Sep 19:04:51.485 * MASTER <-> SLAVE sync: Flushing old data
46590:S 09 Sep 19:05:58.695 * MASTER <-> SLAVE sync: Loading DB in memory

步17:
集群恢复正常:
42645:M 09 Sep 19:05:58.786 * Clear FAIL state for node bedab2c537fe94f8c0363ac4ae97d56832316e65: slave is reachable again.

步18:
从节点8371同步数据成功,耗时7分钟:
46590:S 09 Sep 19:08:19.303 * MASTER <-> SLAVE sync: Finished with success

8371失联原因分析:

由于几台机器在同一个机架,不太可能发生网络中断的情况,于是通过SLOWLOG GET命令查看了慢查询日志,发现有一个KEYS命令被执行了,耗时8.3秒,再查看集群节点超时设置,发现是5s(cluster-node-timeout 5000)

出现节点失联的原因:

客户端执行了耗时1条8.3s的命令,

2016/9/9 18:57:43 开始执行KEYS命令
2016/9/9 18:57:50 8371被判断失联(redis日志)
2016/9/9 18:57:51 执行完KEYS命令

总结来说,有以下几个问题:

1.由于cluster-node-timeout设置比较短,慢查询KEYS导致了集群判断节点8371失联

2.由于8371失联,导致8373升级为主,开始主从同步

3.由于配置client-output-buffer-limit的限制,导致第一次全量同步失败了

4.又由于PHP客户端的连接池有问题,疯狂连接服务器,产生了类似SYN攻击的效果

5.第一次全量同步失败后,从节点重连主节点花了30秒(超过了最大连接数1w)

关于client-output-buffer-limit参数:

# The syntax of every client-output-buffer-limit directive is the following: # # client-output-buffer-limit <class> <hard limit> <soft limit> <soft seconds> # # A client is immediately disconnected once the hard limit is reached, or if # the soft limit is reached and remains reached for the specified number of # seconds (continuously). # So for instance if the hard limit is 32 megabytes and the soft limit is # 16 megabytes / 10 seconds, the client will get disconnected immediately # if the size of the output buffers reach 32 megabytes, but will also get # disconnected if the client reaches 16 megabytes and continuously overcomes # the limit for 10 seconds. # # By default normal clients are not limited because they don't receive data # without asking (in a push way), but just after a request, so only # asynchronous clients may create a scenario where data is requested faster # than it can read. # # Instead there is a default limit for pubsub and slave clients, since # subscribers and slaves receive data in a push fashion. # # Both the hard or the soft limit can be disabled by setting them to zero. client-output-buffer-limit normal 0 0 0 client-output-buffer-limit slave 256mb 64mb 60 client-output-buffer-limit pubsub 32mb 8mb 60 

采取措施:

1.单实例的切割到4G以下,否则发生主从切换会耗时很长

2.调整client-output-buffer-limit参数,防止同步进行到一半失败

3.调整cluster-node-timeout,不能少于15s

4.禁止任何耗时超过cluster-node-timeout的慢查询,因为会导致主从切换

5.修复客户端类似SYN攻击的疯狂连接方式

总结

以上就是本文关于详细分析Redis集群故障的全部内容,希望对大家有所帮助。感兴趣的朋友可以参阅:Spring AOP实现Redis缓存数据库查询源码、简述Redis和MySQL的区别、oracle 数据库启动阶段分析等,如有不足之处,请留言之处。小编会及时更正。感谢朋友们对站的支持!


  • 上一条:
    Redis源码解析:集群手动故障转移、从节点迁移详解
    下一条:
    深入解析Redis中常见的应用场景
  • 昵称:

    邮箱:

    0条评论 (评论内容有缓存机制,请悉知!)
    最新最热
    • 分类目录
    • 人生(杂谈)
    • 技术
    • linux
    • Java
    • php
    • 框架(架构)
    • 前端
    • ThinkPHP
    • 数据库
    • 微信(小程序)
    • Laravel
    • Redis
    • Docker
    • Go
    • swoole
    • Windows
    • Python
    • 苹果(mac/ios)
    • 相关文章
    • 在Redis中能实现的功能、常见应用介绍(0个评论)
    • 2024年Redis面试题之一(0个评论)
    • 在redis缓存常见出错及解决方案(0个评论)
    • 在redis中三种特殊数据类型:地理位置、基数(cardinality)估计、位图(Bitmap)使用场景介绍浅析(2个评论)
    • Redis 删除 key用 del 和 unlink 有啥区别?(1个评论)
    • 近期文章
    • 在go语言中使用api.geonames.org接口实现根据国际邮政编码获取地址信息功能(1个评论)
    • 在go语言中使用github.com/signintech/gopdf实现生成pdf分页文件功能(0个评论)
    • gmail发邮件报错:534 5.7.9 Application-specific password required...解决方案(0个评论)
    • 欧盟关于强迫劳动的规定的官方举报渠道及官方举报网站(0个评论)
    • 在go语言中使用github.com/signintech/gopdf实现生成pdf文件功能(0个评论)
    • Laravel从Accel获得5700万美元A轮融资(0个评论)
    • 在go + gin中gorm实现指定搜索/区间搜索分页列表功能接口实例(0个评论)
    • 在go语言中实现IP/CIDR的ip和netmask互转及IP段形式互转及ip是否存在IP/CIDR(0个评论)
    • PHP 8.4 Alpha 1现已发布!(0个评论)
    • Laravel 11.15版本发布 - Eloquent Builder中添加的泛型(0个评论)
    • 近期评论
    • 122 在

      学历:一种延缓就业设计,生活需求下的权衡之选中评论 工作几年后,报名考研了,到现在还没认真学习备考,迷茫中。作为一名北漂互联网打工人..
    • 123 在

      Clash for Windows作者删库跑路了,github已404中评论 按理说只要你在国内,所有的流量进出都在监控范围内,不管你怎么隐藏也没用,想搞你分..
    • 原梓番博客 在

      在Laravel框架中使用模型Model分表最简单的方法中评论 好久好久都没看友情链接申请了,今天刚看,已经添加。..
    • 博主 在

      佛跳墙vpn软件不会用?上不了网?佛跳墙vpn常见问题以及解决办法中评论 @1111老铁这个不行了,可以看看近期评论的其他文章..
    • 1111 在

      佛跳墙vpn软件不会用?上不了网?佛跳墙vpn常见问题以及解决办法中评论 网站不能打开,博主百忙中能否发个APP下载链接,佛跳墙或极光..
    • 2017-12
    • 2020-03
    • 2020-05
    • 2021-04
    • 2022-03
    • 2022-05
    • 2022-08
    • 2023-02
    • 2023-04
    • 2023-07
    • 2024-01
    • 2024-02
    Top

    Copyright·© 2019 侯体宗版权所有· 粤ICP备20027696号 PHP交流群

    侯体宗的博客