GCP Solving MySQL HA Replication Lag

Published in

Google Cloud - Community

6 min readDec 30, 2019

TL;DR : This solutions focuses on usage of multi zonal disks (called regional disks) for achieving durable MySQL HA Replication. This is GCE based implementation of CloudSQL’s replication technique.

Problem Statement

Public clouds such as GCP start showing replication lag for MySQL Replica instances when durability is ON (i.e. sync_binlog = 1) and average load is >~300 tps (transactions per second). However, bare metal hardware does not shows the lag at ~300 tps limits.
NOTE: 300 TPS limit can vary based on transaction sizes and types, single or multi-threaded replication etc.

Secondary goal — Achieve failover within 30 secs.

Assumptions

Mysql Version : 5.7.10+
Replication type : Asynchronous
Durability : Required (i.e. sync_binlog = 1)
Failover Time : Under 30 secs

Understanding the Problem

Why is durability needed ?

DBA’s need to have durability while replicating to ensure consistency and robustness of data in replica. In addition, it makes failover faster and easier as no changes needs to be made on replica, except telling it to stop replicating.

What causes lag in durable replication ?

Root cause of the replication lag is due to latency added with virtualization.

In addition it is further highlighted due to ineffectiveness of mysql replicating when multi-threaded. MySQL is not able to use multiple threads well enough to distribute replication load between threads and often resorts to one single thread doing most of the work.

Why is MASTER not lagging or high latency ?

Master is receiving traffic via multiple connections (threads) and are written to database organically. In other words MySQL working as master is effective using multiple threads, which is not the case while its replicating.

Unless Master is in semi-sync replication setup, it will likely not show high latency. Even semi-sync mode also only ensure that transaction is committed to at-least just 1 replica. Therefore, if you are replicating from on-prem master to (1) on-prem replica. (2) gcp replica. Master will not show additional latency.

Solution to mysql replication lag

In the above section, we have established that replication lag in MySQL happens due to (1) latency due to virtualization (2) in-effectiveness of MySQL Software while replicating

To solve this, we can eliminate (2) by using GCP’s regional persistent disks. These disks are bi-zonal in a given region. So anything written on it is synchronously written in both zones simultaneously.

Deployment diagram

GCP Internal Load Balancer is created to send all the traffic to umig1 (unmanaged instance group 1) and only in failover send to umig2.
VMs DB zn1 and DB zn2 are installed with mysql using same boot image. (more details in next section).
Each of VM is added into separate unmanaged instance group (umig).
Regional SSD is disk used for data and logs writing by mysql. Only one VM can attach it in Read/Write mode at a given time.

Details of setting up regional disks based approach

Create golden boot disk.
a. Create a new VM and ssh into VM
b. Install the software(s) mysql (and any monitoring softwares, if needed). c. Configure mysql in my.cnf file to use mounted disk path.
datadir = /mnt/attached-disk/msql/data log-bin = /mnt/attached-disk/msql/log/repl
d. Disable mysql service autostart
sudo systemctl disable mysqlde. Shutdown VM
f. Create image of boot disk
Create regional pd-ssd.
Create the primary and standby VMs with metadata “attached_disk_name=<name of regional pd-ssd>”.
Note: Allow all the compute engine API permissions to the VM.
Login into both the above VMs and copy failover.sh script (below).
Execute above failover.sh in primary instance to start serving mysql.

failover.sh

#### Variables
export other_vm_name="<other instance name primary/standby>"
export other_vm_zone="us-central1-a"
export mysql_root_password="<password>"
#### Variables ENDecho $(date) " starting failover #############"
gcloud compute ssh $other_vm_name --zone $other_vm_zone --quiet -- -o ConnectTimeout=10 -t "sudo systemctl stop mysqld"
echo $(date) " stopped current master ##########"export inst_zone=$(curl "http://metadata.google.internal/computeMetadata/v1/instance/zone" -H "Metadata-Flavor: Google" |  awk '{split($0,a,"/"); print a[4]}')
export inst_name=$(curl "http://metadata.google.internal/computeMetadata/v1/instance/name" -H "Metadata-Flavor: Google")
export attached_disk_name=$(curl "http://metadata.google.internal/computeMetadata/v1/instance/attributes/attached_disk_name" -H "Metadata-Flavor: Google")gcloud compute instances attach-disk $inst_name --zone "$inst_zone" --disk "$attached_disk_name" --device-name "$attached_disk_name" --mode rw --force-attach --disk-scope "regional"echo $(date) " attached regional disks ##########"sudo mount -o discard,defaults /dev/sdb /mnt/attached-diskecho $(date) " mounted regional disks ##########"sudo systemctl start mysqldecho $(date) " started mysql ##########"
echo $(date) " failover complete ##########"mysql -uroot -p$mysql_root_password -e "show master status"

Executing failover

Execute failover.sh in standby vm. Load balancer will auto detect and switch over traffic.

Executing fail-back

Execute failover.sh from primary vm. Load balancer will auto detect and switch over traffic.

Multi Region Deployment

Similarly, multi region replication can be deployed so that both DC and DR are Highly Available. Only difference being, replication from DC to DR has to be non-durable to catch up replication lag.

A few points to consider in this architecture:

Migrating MySQL and failover of instances is the focus of this document. Hence the diagram shows all other aspects in rather simple terms.
Each database vm is in a different zone i.e. ZN[1–4]
The picture clearly shows Primary and Standby in both the regions. There is no replication between Primary DB and Standby.
Standby will have mysqld service stopped. Standby will not be attached with regional ssd(s), until failure of primary node (in respective region) happens.
Each Load Balancer should send traffic to standby “Only in case of failover”.
DR Primary (zn3) node has non-durable replication from DC Primary (zn1).
Unidirectional replication from Region1 — — to — → Region2. Since region2’s mysql is non-durable.
[Optional] Application traffic is routed using a dns service.
[Optional] Additional read replicas if configured, will have non-durable replication with zonal ssd.

Zonal Failover & Failback with Regional PD-SSD

In the event primary master (DB zn1 Primary), the topology will be updated as below.

Once failure is detected a script (failover.sh as above) will need to perform following steps:

Attempt for graceful failure. Try stopping primary database (with timeout of 10 secs).
Force attach the regional pd-ssd to standby vm.
Start the mysql service on standby.
Change Master on DR Primary, to continue the replication in DR.

Once steps 1,2 & 3 are done. Ideally, Load Balancer will auto detect that standby instance is healthy and it will start sending all the traffic to standby.

Adding Multiple Read Replicas

Multiple read replicas (for mysql performance) can be further added to same setup. By adding independent mysql nodes replicating non-durably.

Alternative Solutions

While on our journey to solve this we found following more ways to solve it. But all of them are achieved via tweaking mysql (application layer) replication settings and therefore each of it have some short comings.

Semi Synchronous replication: It might solve replication lag at cost of increasing latency of master. Which is often not desirable.
Turn off log_slave_updates: Often replication lag is elevated when replicas also generate bin logs. Getting rid of bin logs and fasten replication until 700 tps limit (nearly double). This can be turned off using (log_slave_updates = OFF) in my.cnf. Downside being, chain replication cannot be done. Therefore say any read replicas in DR will also have to replicate from primary DC master.
Multi threaded replication: We tried and it showed great results with load generated via sysbench. However, with real world load it did not performed better enough to completely catch up replication lag.

Conclusion

Using regional persistent disks is great option for creating high availability database infrastructure. Giving a failover time of < 30secs.

Thanks for reading. Please reply to this post and let me know, if you would like to see terraform or shell scripts to spin up whole of this infrastructure. I can write a part-2 giving complete end to end code for this.