MYSQL MHA完全指南

MySQL MHA (Master High Availability) 完全指南※

一、MHA 概述※

1.1 什么是 MHA※

MHA（Master High Availability）是一款优秀的 MySQL 高可用性工具，用于实现自动化的主从故障切换和主从提升功能。MHA 在监测到 master 节点故障时，会自动将最新数据的 slave 提升为新的 master，并让其他 slave 重新指向新的 master，整个故障转移过程对应用程序完全透明。

1.2 MHA 架构※

graph TB 
Manager["MHA Manager 监控节点"] -.监控.-> Master 
Manager -.监控.-> Slave1 
Manager -.监控.-> Slave2 
Master["Master 主库 192.168.1.10"] -->|复制| Slave1["Slave 1 从库 1 192.168.1.11"] 
Master -->|复制| Slave2["Slave 2 从库 2 192.168.1.12"] 
style Manager fill:#ff9999,stroke:#333,stroke-width:2px 
style Master fill:#99ccff,stroke:#333,stroke-width:2px 
style Slave1 fill:#99ff99,stroke:#333,stroke-width:2px 
style Slave2 fill:#ffcc99,stroke:#333,stroke-width:2px

1.3 MHA 核心组件※

MHA Manager（管理节点）：负责监控所有 MySQL 节点，并在主库故障时执行故障转移
MHA Node（数据节点）：运行在每台 MySQL 服务器上，用于故障切换时的数据补偿和节点切换

1.4 MHA 特点※

✅ 自动故障检测和故障转移（通常在 10-30 秒内完成）
✅ 最大程度保证数据一致性
✅ 不需要对现有复制环境进行重大修改
✅ 支持手动和自动故障切换
✅ 适用于任何存储引擎

二、环境准备※

2.1 系统要求※

操作系统：Linux（CentOS/RHEL/Ubuntu）
MySQL 版本：5.5 及以上（建议 5.7 或 8.0）
Perl 版本：5.8 或更高
至少 3 台服务器（1 个 Manager + 1 个 Master + 1 个 Slave）

2.2 网络拓扑示例※

角色	主机名	IP 地址	MySQL 版本
MHA Manager	mha-manager	192.168.1.100	-
Master	mysql-master	192.168.1.10	8.0.x
Slave 1	mysql-slave1	192.168.1.11	8.0.x
Slave 2	mysql-slave2	192.168.1.12	8.0.x

2.3 配置 SSH 免密登录※

MHA Manager 需要能够通过 SSH 无密码登录到所有 MySQL 节点。

# 在 MHA Manager 节点执行
ssh-keygen -t rsa -b 4096 -N "" -f ~/.ssh/id_rsa

# 将公钥复制到所有 MySQL 节点
ssh-copy-id root@192.168.1.10
ssh-copy-id root@192.168.1.11
ssh-copy-id root@192.168.1.12

# 测试免密登录
ssh root@192.168.1.10 "hostname"

三、MHA 安装※

3.1 安装依赖包※

CentOS/RHEL 系统※

# 在所有节点（Manager 和 MySQL 节点）上执行
yum install -y epel-release
yum install -y perl-DBD-MySQL perl-Config-Tiny perl-Log-Dispatch \
    perl-Parallel-ForkManager perl-Time-HiRes perl-DBI

Ubuntu/Debian 系统※

# 在所有节点上执行
apt-get update
apt-get install -y libdbd-mysql-perl libconfig-tiny-perl \
    liblog-dispatch-perl libparallel-forkmanager-perl

3.2 安装 MHA Node※

在所有 MySQL 节点上安装 MHA Node：

# 下载 MHA Node 包
cd /tmp
wget https://github.com/yoshinorim/mha4mysql-node/releases/download/v0.58/mha4mysql-node-0.58.tar.gz
tar -zxf mha4mysql-node-0.58.tar.gz
cd mha4mysql-node-0.58

# 安装
perl Makefile.PL
make
make install

# 验证安装
which save_binary_logs
which apply_diff_relay_logs

3.3 安装 MHA Manager※

仅在 MHA Manager 节点上安装：

# 先安装 MHA Node（Manager 依赖 Node）
cd /tmp
wget https://github.com/yoshinorim/mha4mysql-node/releases/download/v0.58/mha4mysql-node-0.58.tar.gz
tar -zxf mha4mysql-node-0.58.tar.gz
cd mha4mysql-node-0.58
perl Makefile.PL && make && make install

# 安装 MHA Manager
cd /tmp
wget https://github.com/yoshinorim/mha4mysql-manager/releases/download/v0.58/mha4mysql-manager-0.58.tar.gz
tar -zxf mha4mysql-manager-0.58.tar.gz
cd mha4mysql-manager-0.58
perl Makefile.PL
make
make install

# 验证安装
which masterha_manager
which masterha_check_ssh
which masterha_check_repl

四、MySQL 主从配置※

4.1 Master 配置※

编辑 Master 节点的 /etc/my.cnf：

[mysqld]
# 服务器 ID（每台服务器唯一）
server-id = 1

# 开启二进制日志
log-bin = mysql-bin
binlog_format = ROW

# 半同步复制（可选但推荐）
rpl_semi_sync_master_enabled = 1
rpl_semi_sync_master_timeout = 1000

# 中继日志
relay-log = relay-bin
relay-log-index = relay-bin.index

# GTID 模式（推荐）
gtid_mode = ON
enforce_gtid_consistency = ON

# 其他重要参数
log-slave-updates = 1
read-only = 0
skip-slave-start = 1

4.2 Slave 配置※

编辑 Slave 节点的 /etc/my.cnf（Slave1 和 Slave2）：

[mysqld]
# 服务器 ID（每台不同，如 2, 3）
server-id = 2  # Slave1 用 2，Slave2 用 3

log-bin = mysql-bin
binlog_format = ROW

# 半同步复制
rpl_semi_sync_slave_enabled = 1

relay-log = relay-bin
relay-log-index = relay-bin.index

# GTID
gtid_mode = ON
enforce_gtid_consistency = ON

log-slave-updates = 1
read-only = 1
skip-slave-start = 1

4.3 创建复制用户※

在 Master 上执行：

-- 创建复制用户
CREATE USER 'repl'@'192.168.1.%' IDENTIFIED BY 'Repl@Pass123';
GRANT REPLICATION SLAVE ON *.* TO 'repl'@'192.168.1.%';

-- 创建 MHA 管理用户
CREATE USER 'mha'@'192.168.1.%' IDENTIFIED BY 'MHA@Pass123';
GRANT ALL PRIVILEGES ON *.* TO 'mha'@'192.168.1.%';

FLUSH PRIVILEGES;

4.4 配置主从复制※

在 Slave1 和 Slave2 上执行：

-- 使用 GTID 模式配置复制
CHANGE MASTER TO
    MASTER_HOST='192.168.1.10',
    MASTER_USER='repl',
    MASTER_PASSWORD='Repl@Pass123',
    MASTER_AUTO_POSITION=1;

-- 启动复制
START SLAVE;

-- 查看复制状态
SHOW SLAVE STATUS\G

-- 确认以下两项都是 Yes
-- Slave_IO_Running: Yes
-- Slave_SQL_Running: Yes

4.5 主从架构验证※

graph TD
M["Master 192.168.1.10 server-id=1"] -->|GTID Replication| S1["Slave 1 192.168.1.11 server-id=2"] 
M -->|GTID Replication| S2["Slave 2 192.168.1.12 server-id=3"] 
style M fill:#99ccff 
style S1 fill:#99ff99 
style S2 fill:#ffcc99

五、MHA 配置※

5.1 创建配置目录※

在 MHA Manager 节点上：

mkdir -p /etc/mha
mkdir -p /var/log/mha/app1

5.2 编写 MHA 配置文件※

创建 /etc/mha/app1.cnf：

[server default]
# MHA 管理用户
user=mha
password=MHA@Pass123

# SSH 用户
ssh_user=root
ssh_port=22

# 复制用户
repl_user=repl
repl_password=Repl@Pass123

# 工作目录
manager_workdir=/var/log/mha/app1
manager_log=/var/log/mha/app1/manager.log
remote_workdir=/var/log/mha/app1

# 主从切换时执行的脚本
master_ip_failover_script=/usr/local/bin/master_ip_failover
master_ip_online_change_script=/usr/local/bin/master_ip_online_change

# 可选：发送报告脚本
# report_script=/usr/local/bin/send_report

# 故障转移后关闭 MHA Manager
shutdown_script=""

# 从库的 SQL 线程最大延迟时间（秒）
# 如果从库延迟超过这个值，不会被选为新的主库
candidate_master=1
check_repl_delay=0

# 主库宕机后的故障切换次数（0 表示无限次）
master_ip_failover_script=/usr/local/bin/master_ip_failover

# 定义服务器节点
[server1]
hostname=192.168.1.10
port=3306
# 候选 master（权重最高）
candidate_master=1

[server2]
hostname=192.168.1.11
port=3306
# 候选 master
candidate_master=1

[server3]
hostname=192.168.1.12
port=3306
# 不作为候选 master
# candidate_master=0
# no_master=1

5.3 创建 VIP 切换脚本※

创建 /usr/local/bin/master_ip_failover（用于自动故障切换时的 VIP 漂移）：

#!/usr/bin/env perl

use strict;
use warnings FATAL => 'all';

use Getopt::Long;

my (
    $command,          $ssh_user,        $orig_master_host, $orig_master_ip,
    $orig_master_port, $new_master_host, $new_master_ip,    $new_master_port
);

my $vip = '192.168.1.200/24';
my $key = '1';
my $ssh_start_vip = "/sbin/ip addr add $vip dev eth0 label eth0:$key";
my $ssh_stop_vip = "/sbin/ip addr del $vip dev eth0 label eth0:$key";

GetOptions(
    'command=s'          => \$command,
    'ssh_user=s'         => \$ssh_user,
    'orig_master_host=s' => \$orig_master_host,
    'orig_master_ip=s'   => \$orig_master_ip,
    'orig_master_port=i' => \$orig_master_port,
    'new_master_host=s'  => \$new_master_host,
    'new_master_ip=s'    => \$new_master_ip,
    'new_master_port=i'  => \$new_master_port,
);

exit &main();

sub main {
    print "\n\nIN SCRIPT TEST====$ssh_stop_vip==$ssh_start_vip===\n\n";

    if ( $command eq "stop" || $command eq "stopssh" ) {
        my $exit_code = 1;
        eval {
            print "Disabling the VIP on old master: $orig_master_host \n";
            &stop_vip();
            $exit_code = 0;
        };
        if ($@) {
            warn "Got Error: $@\n";
            exit $exit_code;
        }
        exit $exit_code;
    }
    elsif ( $command eq "start" ) {
        my $exit_code = 10;
        eval {
            print "Enabling the VIP - $vip on the new master - $new_master_host \n";
            &start_vip();
            $exit_code = 0;
        };
        if ($@) {
            warn $@;
            exit $exit_code;
        }
        exit $exit_code;
    }
    elsif ( $command eq "status" ) {
        print "Checking the Status of the script.. OK \n";
        exit 0;
    }
    else {
        &usage();
        exit 1;
    }
}

sub start_vip() {
    `ssh $ssh_user\@$new_master_host \" $ssh_start_vip \"`;
}

sub stop_vip() {
    return 0  unless  ($ssh_user);
    `ssh $ssh_user\@$orig_master_host \" $ssh_stop_vip \"`;
}

sub usage {
    print
    "Usage: master_ip_failover --command=start|stop|stopssh|status --orig_master_host=host --orig_master_ip=ip --orig_master_port=port --new_master_host=host --new_master_ip=ip --new_master_port=port\n";
}

赋予执行权限：

chmod +x /usr/local/bin/master_ip_failover

5.4 配置 VIP（虚拟 IP）※

在当前 Master 节点上手动添加 VIP：

# 添加 VIP
ip addr add 192.168.1.200/24 dev eth0 label eth0:1

# 查看 VIP
ip addr show eth0

# 测试连接 VIP
ping 192.168.1.200

六、MHA 检查与启动※

6.1 检查 SSH 连接※

在 MHA Manager 上执行：

masterha_check_ssh --conf=/etc/mha/app1.cnf

成功输出示例：

Mon Feb  4 12:00:00 2026 - [info] All SSH connection tests passed successfully.

6.2 检查主从复制※

masterha_check_repl --conf=/etc/mha/app1.cnf

成功输出示例：

MySQL Replication Health is OK.

6.3 启动 MHA Manager※

方式 1：前台启动（测试用）※

masterha_manager --conf=/etc/mha/app1.cnf

方式 2：后台启动（生产环境）※

nohup masterha_manager --conf=/etc/mha/app1.cnf > /var/log/mha/app1/manager.log 2>&1 &

6.4 查看 MHA 状态※

# 查看 MHA 状态
masterha_check_status --conf=/etc/mha/app1.cnf

# 查看日志
tail -f /var/log/mha/app1/manager.log

6.5 MHA 工作流程图※


sequenceDiagram
    participant Manager as MHA Manager
    participant Master as MySQL Master
    participant Slave1 as MySQL Slave1
    participant Slave2 as MySQL Slave2

    loop Heartbeat monitoring
        Manager->>Master: ping/healthcheck
        Manager->>Slave1: ping/healthcheck
        Manager->>Slave2: ping/healthcheck
    end

    Note over Master: Master down

    Manager->>Master: healthcheck timeout
    alt Master unreachable
        Manager->>Slave1: confirm Master status (2nd check)
        Manager->>Slave2: confirm Master status (2nd check)
        Note over Manager: Master failure confirmed

        Manager->>Slave1: collect relaylog/binlog pos
        Manager->>Slave2: collect relaylog/binlog pos
        Note over Manager: choose the best candidate (e.g. Slave1)

        Manager->>Slave1: apply differential logs
        Manager->>Slave1: promote to new Master
        Manager->>Slave1: switch VIP (or update endpoint)

        Manager->>Slave2: change master to Slave1
        Manager->>Slave2: start replication
        Note over Slave1: New Master
        Note over Slave2: Replicating from new Master
    else Master reachable
        Note over Manager: false alarm, continue monitoring
    end

七、MHA 使用操作※

7.1 查看当前主从状态※

# 检查复制拓扑
masterha_check_repl --conf=/etc/mha/app1.cnf

# 查看 MHA 运行状态
masterha_check_status --conf=/etc/mha/app1.cnf

7.2 手动故障切换（Master 维护）※

当需要对 Master 进行计划内维护时，使用在线切换：

# 手动在线切换（Master 正常运行）
masterha_master_switch \
    --conf=/etc/mha/app1.cnf \
    --master_state=alive \
    --new_master_host=192.168.1.11 \
    --new_master_port=3306 \
    --orig_master_is_new_slave \
    --running_updates_limit=10000 \
    --interactive=0

# 参数说明：
# --master_state=alive : 当前 Master 存活
# --new_master_host : 指定新 Master
# --orig_master_is_new_slave : 原 Master 切换后变为 Slave
# --running_updates_limit : 允许的最大未同步事务数
# --interactive=0 : 非交互模式

7.3 手动故障切换（Master 宕机）※

当 Master 已经宕机，MHA Manager 未自动切换时：

# 手动故障切换
masterha_master_switch \
    --conf=/etc/mha/app1.cnf \
    --master_state=dead \
    --dead_master_host=192.168.1.10 \
    --dead_master_port=3306 \
    --new_master_host=192.168.1.11 \
    --new_master_port=3306 \
    --ignore_last_failover

# --master_state=dead : Master 已宕机
# --dead_master_host : 宕机的 Master IP
# --ignore_last_failover : 忽略上次故障切换时间限制

7.4 停止 MHA Manager※

# 优雅停止
masterha_stop --conf=/etc/mha/app1.cnf

# 强制停止
killall -9 masterha_manager

7.5 重启 MHA Manager※

# 停止
masterha_stop --conf=/etc/mha/app1.cnf

# 检查状态（应显示已停止）
masterha_check_status --conf=/etc/mha/app1.cnf

# 启动
nohup masterha_manager --conf=/etc/mha/app1.cnf > /var/log/mha/app1/manager.log 2>&1 &

# 验证
masterha_check_status --conf=/etc/mha/app1.cnf

八、故障切换流程详解※

8.1 自动故障切换流程※

graph TD 
A[MHA Manager 检测到 Master 不可达] --> B{二次确认} 
B -->|仍不可达| C[确认 Master 宕机] 
B -->|可达| D[恢复监控] 
C --> E[关闭故障 Master 的写入] 
E --> F[识别包含最新更新的 Slave] 
F --> G[应用差异中继日志到其他 Slave] 
G --> H[提升拥有最新数据的 Slave 为新 Master] 
H --> I[配置其他 Slave 指向新 Master] 
I --> J[启动 VIP 到新 Master] 
J --> K[切换完成] 
style A fill:#ff9999 style C fill:#ff9999 style H fill:#99ff99 style K fill:#99ccff

8.2 故障切换时间线※

0-3 秒：MHA Manager 检测到 Master 无响应
3-4 秒：通过 Slave 二次确认 Master 状态
4-5 秒：确认 Master 宕机，开始故障切换流程
5-10 秒：对比 Slave 的 binlog 位置，选择最新的 Slave
10-20 秒：应用差异日志，数据补偿
20-25 秒：提升 Slave 为新 Master，配置 VIP
25-30 秒：其他 Slave 重新指向新 Master

8.3 故障切换后的架构变化※

graph LR
subgraph "故障前" 
  M1["Master (原) 192.168.1.10 VIP: 192.168.1.200"] --> S1_1["Slave 1 192.168.1.11"] 
  M1 --> S1_2["Slave 2 192.168.1.12"] 
end 
subgraph "故障后" 
  M2["Master (新) 192.168.1.11 VIP: 192.168.1.200"] --> S2_1["Slave 2 192.168.1.12"] 
  M2 -.待修复.-> S2_2["原 Master (已停) 192.168.1.10"] 
end 
style M1 fill:#99ccff 
style M2 fill:#99ff99 
style S2_2 fill:#ff9999,stroke-dasharray: 5 5

九、数据同步与一致性※

9.1 复制延迟监控※

-- 在 Slave 上查看复制延迟
SHOW SLAVE STATUS\G

-- 关键字段：
-- Seconds_Behind_Master: 延迟秒数（NULL 表示复制未运行）
-- Slave_IO_Running: IO 线程状态
-- Slave_SQL_Running: SQL 线程状态
-- Retrieved_Gtid_Set: 已接收的 GTID
-- Executed_Gtid_Set: 已执行的 GTID

9.2 GTID 一致性检查※

-- 在 Master 上查看 GTID
SHOW MASTER STATUS;

-- 在 Slave 上查看 GTID
SHOW SLAVE STATUS\G

-- 比较 Executed_Gtid_Set 是否一致

9.3 半同步复制配置※

半同步复制确保至少一个 Slave 收到 binlog 后 Master 才提交事务：

-- 在 Master 上
INSTALL PLUGIN rpl_semi_sync_master SONAME 'semisync_master.so';
SET GLOBAL rpl_semi_sync_master_enabled = 1;
SET GLOBAL rpl_semi_sync_master_timeout = 1000; -- 1 秒超时

-- 在 Slave 上
INSTALL PLUGIN rpl_semi_sync_slave SONAME 'semisync_slave.so';
SET GLOBAL rpl_semi_sync_slave_enabled = 1;
STOP SLAVE IO_THREAD; START SLAVE IO_THREAD;

-- 查看半同步状态
SHOW STATUS LIKE 'Rpl_semi_sync%';

9.4 数据一致性验证工具※

# 使用 pt-table-checksum 检查数据一致性
pt-table-checksum \
    --host=192.168.1.10 \
    --user=mha \
    --password=MHA@Pass123 \
    --databases=mydb \
    --replicate=percona.checksums

# 使用 pt-table-sync 修复不一致数据
pt-table-sync \
    --execute \
    --sync-to-master \
    h=192.168.1.11,u=mha,p=MHA@Pass123

十、故障处理与恢复※

10.1 Master 宕机后的恢复※

场景 1：硬件故障，Master 无法恢复※

MHA 自动完成故障切换
修复或更换硬件
重新安装 MySQL
将原 Master 配置为新的 Slave

# 在修复后的原 Master 上
# 配置为 Slave
mysql -e "CHANGE MASTER TO \
    MASTER_HOST='192.168.1.11', \
    MASTER_USER='repl', \
    MASTER_PASSWORD='Repl@Pass123', \
    MASTER_AUTO_POSITION=1;"

mysql -e "START SLAVE;"
mysql -e "SHOW SLAVE STATUS\G"

场景 2：软件故障，Master 可快速恢复※

MHA 完成故障切换
修复原 Master 的问题
将原 Master 作为 Slave 加入集群
可选：计划内切换回原 Master

10.2 Slave 故障处理※

Slave 宕机※

# 1. 查看其他 Slave 状态
mysql -h192.168.1.12 -umha -p -e "SHOW SLAVE STATUS\G"

# 2. 修复故障 Slave
# 3. 重新配置复制
mysql -h192.168.1.11 -umha -p -e "
CHANGE MASTER TO
    MASTER_HOST='192.168.1.10',
    MASTER_USER='repl',
    MASTER_PASSWORD='Repl@Pass123',
    MASTER_AUTO_POSITION=1;
START SLAVE;"

Slave 复制中断※

-- 查看错误
SHOW SLAVE STATUS\G

-- 常见错误处理
-- 1. 跳过错误（谨慎使用）
STOP SLAVE;
SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1;
START SLAVE;

-- 2. 使用 GTID 跳过错误事务
STOP SLAVE;
SET GTID_NEXT='xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx:N';
BEGIN; COMMIT;
SET GTID_NEXT='AUTOMATIC';
START SLAVE;

-- 3. 重建复制（最彻底）
STOP SLAVE;
RESET SLAVE ALL;
CHANGE MASTER TO ... ;
START SLAVE;

10.3 MHA Manager 故障※

MHA Manager 故障不影响 MySQL 主从复制，但无法进行自动故障切换：

修复 MHA Manager 节点
检查配置文件 /etc/mha/app1.cnf
重新启动 MHA Manager

# 检查 SSH 和复制
masterha_check_ssh --conf=/etc/mha/app1.cnf
masterha_check_repl --conf=/etc/mha/app1.cnf

# 启动 MHA Manager
nohup masterha_manager --conf=/etc/mha/app1.cnf > /var/log/mha/app1/manager.log 2>&1 &

10.4 脑裂问题处理※

脑裂：网络故障导致 MHA Manager 误判 Master 宕机，产生多个 Master。

graph TB 
subgraph "脑裂场景" 
  Manager["MHA Manager 网络隔离"] -.X.-> Master["原 Master 仍在运行 192.168.1.10"] 
  Manager --> Slave1["提升为新 Master 192.168.1.11"] 
end 
style Master fill:#ff9999 
style Slave1 fill:#ff9999 
style Manager fill:#ffcc99

预防措施：

配置二次确认机制（通过 Slave 确认 Master 状态）
使用 VIP 机制确保只有一个 Master 对外服务
配置防火墙规则，禁止多个 Master 同时写入

处理步骤：

立即停止应用写入
比较两个 Master 的 GTID 和数据
选择数据最新的 Master
将另一个 Master 降级为 Slave
合并差异数据（如有必要）

十一、常见问题与解决方案※

11.1 MHA Manager 无法启动※

问题现象※

Error: MySQL master is not configured correctly.

可能原因※

配置文件路径错误
MySQL 用户权限不足
SSH 免密登录未配置
主从复制状态异常

解决方案※

# 1. 检查配置文件语法
cat /etc/mha/app1.cnf

# 2. 检查 SSH 连接
masterha_check_ssh --conf=/etc/mha/app1.cnf

# 3. 检查复制状态
masterha_check_repl --conf=/etc/mha/app1.cnf

# 4. 检查 MySQL 用户权限
mysql -h192.168.1.10 -umha -p -e "SHOW GRANTS FOR 'mha'@'192.168.1.%';"

# 5. 查看详细日志
tail -100 /var/log/mha/app1/manager.log

11.2 故障切换后 MHA Manager 不自动重启※

问题说明※

MHA Manager 在完成一次故障切换后会自动退出，需要手动重启。

解决方案 1：使用 systemd 自动重启※

创建 /etc/systemd/system/mha-manager.service：

[Unit]
Description=MHA Manager
After=network.target

[Service]
Type=simple
User=root
ExecStart=/usr/bin/masterha_manager --conf=/etc/mha/app1.cnf --wait_on_failover_error=0
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

# 启用服务
systemctl daemon-reload
systemctl enable mha-manager
systemctl start mha-manager
systemctl status mha-manager

解决方案 2：使用监控脚本※

#!/bin/bash
# /usr/local/bin/mha_monitor.sh

while true; do
    status=$(masterha_check_status --conf=/etc/mha/app1.cnf 2>&1)
    if [[ $status == *"is stopped"* ]]; then
        echo "$(date): MHA is stopped, restarting..."
        nohup masterha_manager --conf=/etc/mha/app1.cnf > /var/log/mha/app1/manager.log 2>&1 &
        sleep 10
    fi
    sleep 30
done

11.3 VIP 没有自动漂移※

检查步骤※

# 1. 检查脚本是否存在且可执行
ls -l /usr/local/bin/master_ip_failover
chmod +x /usr/local/bin/master_ip_failover

# 2. 手动测试脚本
perl /usr/local/bin/master_ip_failover \
    --command=status \
    --ssh_user=root \
    --orig_master_host=192.168.1.10

# 3. 检查当前 VIP 位置
ip addr show | grep 192.168.1.200

# 4. 查看 MHA 日志中的脚本执行记录
grep "master_ip_failover" /var/log/mha/app1/manager.log

11.4 复制延迟过大※

问题诊断※

-- 查看延迟
SHOW SLAVE STATUS\G

-- 查看当前正在执行的事务
SHOW PROCESSLIST;

-- 查看复制过滤规则
SHOW SLAVE STATUS\G | grep -i filter

优化方案※

-- 1. 启用并行复制（MySQL 5.7+）
STOP SLAVE;
SET GLOBAL slave_parallel_type = 'LOGICAL_CLOCK';
SET GLOBAL slave_parallel_workers = 4;
START SLAVE;

-- 2. 调整复制相关参数
SET GLOBAL slave_pending_jobs_size_max = 134217728; -- 128M
SET GLOBAL slave_preserve_commit_order = 1;

-- 3. 查看并行复制状态
SHOW PROCESSLIST;
-- 应该看到多个 "system user" 进程

11.5 GTID 不一致※

检查 GTID 状态※

-- 在所有节点上执行
SELECT @@GLOBAL.GTID_MODE, @@GLOBAL.GTID_EXECUTED, @@GLOBAL.GTID_PURGED;

修复 GTID 不一致※

-- 方法 1：注入空事务（跳过缺失的 GTID）
SET GTID_NEXT='xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx:N';
BEGIN; COMMIT;
SET GTID_NEXT='AUTOMATIC';

-- 方法 2：重置 GTID（慎用，会丢失历史）
STOP SLAVE;
RESET MASTER;
CHANGE MASTER TO MASTER_AUTO_POSITION=1;
START SLAVE;

11.6 binlog 文件过大导致磁盘满※

清理策略※

-- 查看 binlog 文件
SHOW BINARY LOGS;

-- 自动清理（在 my.cnf 中配置）
expire_logs_days = 7

-- 手动清理（清理 3 天前的 binlog）
PURGE BINARY LOGS BEFORE DATE_SUB(NOW(), INTERVAL 3 DAY);

-- 清理到指定文件（保留该文件及之后的）
PURGE BINARY LOGS TO 'mysql-bin.000100';

11.7 MHA 检测到的 Master 与实际不符※

原因※

配置文件中的服务器顺序错误
某个 Slave 配置了错误的复制源

解决方案※

# 1. 检查所有节点的复制状态
for host in 192.168.1.10 192.168.1.11 192.168.1.12; do
    echo "=== $host ==="
    mysql -h$host -umha -p -e "SHOW SLAVE STATUS\G" | grep -E "Master_Host|Master_Port"
done

# 2. 修正错误的复制配置
mysql -h192.168.1.12 -umha -p -e "
STOP SLAVE;
CHANGE MASTER TO MASTER_HOST='192.168.1.10';
START SLAVE;"

# 3. 重新检查
masterha_check_repl --conf=/etc/mha/app1.cnf

十二、监控与告警※

12.1 监控指标※

指标类别	监控项	告警阈值
复制状态	Slave_IO_Running	!= Yes
	Slave_SQL_Running	!= Yes
	Seconds_Behind_Master	> 60 秒
MHA 状态	Manager 进程	进程不存在
MHA 状态	最后检查时间	> 5 分钟
系统资源	磁盘使用率	> 80%
系统资源	网络连接	SSH 不通

12.2 监控脚本示例※

#!/bin/bash
# /usr/local/bin/mha_health_check.sh

CONFIG="/etc/mha/app1.cnf"
LOG="/var/log/mha/health_check.log"

echo "$(date): Starting health check" >> $LOG

# 1. 检查 MHA Manager 状态
status=$(masterha_check_status --conf=$CONFIG 2>&1)
if [[ $status == *"is stopped"* ]]; then
    echo "$(date): ERROR - MHA Manager is stopped!" >> $LOG
    # 发送告警
    echo "MHA Manager stopped!" | mail -s "MHA Alert" admin@example.com
fi

# 2. 检查复制延迟
mysql -h192.168.1.11 -umha -pMHA@Pass123 -e "SHOW SLAVE STATUS\G" | grep "Seconds_Behind_Master" >> $LOG
mysql -h192.168.1.12 -umha -pMHA@Pass123 -e "SHOW SLAVE STATUS\G" | grep "Seconds_Behind_Master" >> $LOG

# 3. 检查 VIP
vip_status=$(ip addr show | grep "192.168.1.200")
echo "$(date): VIP status: $vip_status" >> $LOG

echo "$(date): Health check completed" >> $LOG

配置 cron 定时执行：

# 编辑 crontab
crontab -e

# 每 5 分钟检查一次
*/5 * * * * /usr/local/bin/mha_health_check.sh

12.3 集成 Prometheus 监控※

# 使用 mysqld_exporter 导出 MySQL 指标
docker run -d \
  --name mysqld-exporter \
  -p 9104:9104 \
  -e DATA_SOURCE_NAME="mha:MHA@Pass123@(192.168.1.10:3306)/" \
  prom/mysqld-exporter

# Prometheus 配置示例（prometheus.yml）
scrape_configs:
  - job_name: 'mysql'
    static_configs:
      - targets: ['192.168.1.10:9104', '192.168.1.11:9104', '192.168.1.12:9104']

十三、最佳实践※

13.1 架构设计建议※

✅ 至少部署 3 个 MySQL 节点（1 Master + 2 Slave）
✅ MHA Manager 独立部署，避免单点故障
✅ 使用 GTID 模式进行复制
✅ 启用半同步复制提高数据一致性
✅ 配置 VIP 实现应用层透明切换
✅ 定期备份并测试恢复流程

13.2 网络架构优化※

graph TB 
subgraph "应用层" 
  App1[应用服务器 1] 
  App2[应用服务器 2] 
  App3[应用服务器 3] 
end 
subgraph "负载均衡层" 
  VIP["VIP 192.168.1.200"] 
end 
subgraph "数据库层" 
  Master["Master 192.168.1.10"] 
  Slave1["Slave 1 192.168.1.11"] 
  Slave2["Slave 2 192.168.1.12"] 
end 
subgraph "管理层" 
  MHA["MHA Manager 192.168.1.100"] 
end 
App1 --> VIP 
App2 --> VIP 
App3 --> VIP 
VIP --> Master 
Master --> Slave1 
Master --> Slave2 
MHA -.监控.-> Master 
MHA -.监控.-> Slave1 
MHA -.监控.-> Slave2 
style VIP fill:#ff9999 
style Master fill:#99ccff 
style MHA fill:#ffcc99

13.3 安全加固※

限制 MHA 用户权限（仅授予必要权限）
使用 SSL 加密复制连接
配置防火墙规则
定期审计日志

-- 最小权限 MHA 用户
GRANT SELECT, PROCESS, SUPER, REPLICATION CLIENT ON *.* TO 'mha'@'192.168.1.%';
GRANT ALL ON mysql.* TO 'mha'@'192.168.1.%';

13.4 容量规划※

磁盘：binlog 保留至少 3-7 天的容量
内存：确保 buffer pool 足够大（建议物理内存的 60-70%）
网络：千兆或以上，确保复制带宽充足

13.5 定期演练※

每月进行一次手动故障切换演练
验证备份恢复流程
测试 VIP 切换是否正常
检查监控告警是否有效

十四、参考资源※

14.1 官方文档※

14.2 工具下载※

14.3 常用命令速查※

命令	用途
`masterha_check_ssh`	检查 SSH 连接
`masterha_check_repl`	检查复制状态
`masterha_check_status`	查看 MHA 状态
`masterha_manager`	启动 MHA Manager
`masterha_stop`	停止 MHA Manager
`masterha_master_switch`	手动主从切换

十五、总结※

MySQL MHA 是一个成熟可靠的高可用解决方案，具有以下优势：

🎯 快速故障切换：通常在 10-30 秒内完成
🛡️ 数据一致性：通过差异日志补偿确保数据不丢失
🔧 易于部署：不需要修改现有架构
📊 透明切换：应用层无感知

关键要点：

确保 SSH 免密登录正确配置
使用 GTID 和半同步复制提高可靠性
配置 VIP 实现应用层透明切换
定期检查复制状态和延迟
建立完善的监控和告警机制
定期进行故障切换演练

graph TD
A[部署 MySQL 主从] --> B[安装 MHA] 
B --> C[配置 SSH 免密] 
C --> D[配置 VIP 脚本] 
D --> E[启动 MHA Manager] 
E --> F[监控与维护] 
F --> G[定期演练] 
style A fill:#99ccff 
style E fill:#99ff99 
style G fill:#ffcc99

通过本指南，你应该能够完成 MySQL MHA 的完整部署和运维。如有问题，请参考官方文档或社区资源。

“您的支持是我持续分享的动力”

微信

支付宝