[Deprecated] GCP SQL Server using Stateful MIGs for HA and DR

Shashank Agarwal
4 min readNov 29, 2020

--

Update: Stateful MIG does not failover automatically to another Availability Zone (AZ).

Review limitations page.

TLDR;

Deploying traditional database(s) like SQL Server, MySQL, Postgresql etc in a Highly Available setup with minimal cost using GCP.

Introduction

GCP offers stateful managed instance groups which provide an easy and convenient way to deploy a highly available “stateful” service such as database(s). It can be useful when you got to deploy database on a VM and other managed services (like Cloud SQL) does not meet requirements.

NOTE: The same strategy can be applied to other database(s) like Postgresql, MySQL, Oracle, DB2 etc too.

Pros

  • Fully automated node repairs
  • Manual regional HA (across two zones).
  • In-expensive, no standby instances
  • Data replication at GCP Level (via Regional PD), no lag

Cons

  • Longer RTO. For windows ≥~8 mins and ≥~3 mins for linux. Due to time it take for first boot and prepare database instance
  • OS Disk cannot be stateful, hence its recreated in event of failover
  • Startup scripts needed to configure database (attach database, create users, log shipping etc)
  • DR replication is based on log-shipping, which will have lag and potential for data loss of transactions occurred after most recent log backup.

Architecture

Stateful-MIG Setup with HA and DR

As above diagram shows complete solution with HA and DR needs to have any two regions. Primary region can replicate to secondary region via log shipping (other options like db replication, AO/AG can work but they are complex to deploy).

Each region should have a Stateful MIG (S-MIG)setup with regional persistent disk(s) being maintained as stateful device. At any given point in time, there will be exactly one VM in this S-MIG.

In the event of failure, S-MIG’s health check would start failing which will trigger automatic node repair, unless availability zone has outage. S-MIG will delete existing node and create a new node using the os boot disk as supplied with instance template. It will also attach the stateful disk(s) to the newly created vm. Custom startup scripts will need to be create, such that it will prepare database for serving traffic. This could include:

  1. Join active directory (if applicable)
  2. Attach database residing on stateful disks
  3. Create user logins, sql agent jobs (or restore previously backed up system databases).

Deployment

It is important to note that Stateful MIG using Single Zone PD can be created via using GCP Cloud Console. However, at time of writing (Dec 2020), Dual Zone PD (aka Regional PD) based setup cannot be done using the gui. You can use below sequence of gcloud commands to deploy the same.

a. Create regional pd

gcloud compute disks create sql-data-repd \
--type=pd-standard \
--size=200GB \
--region=us-central1 \
--replica-zones=us-central1-c,us-central1-a

b. Create a instance template

gcloud compute instance-templates create sql-smig-template-1 \
--machine-type=e2-standard-4 \
--scopes=https://www.googleapis.com/auth/cloud-platform \
--image=sql-2017-standard-windows-2016-dc-v20201110 \
--image-project=windows-sql-cloud \
--boot-disk-size=50GB \
--boot-disk-type=pd-standard

c. Create health check also respective firewall rules must be created (if not already).

gcloud compute health-checks create tcp sql-port-hc-1 \
--port=1433 \
--proxy-header=NONE \
--check-interval=2 \
--timeout=2 \
--unhealthy-threshold=2 \
--healthy-threshold=2

If needed create firewall rule:

gcloud compute firewall-rules create fw-allow-health-checks \
--action=ALLOW \
--direction=INGRESS \
--source-ranges=35.191.0.0/16,130.211.0.0/22 \
--rules=tcp

d. Create S-MIG with 0 instances. Also, using zones exactly matching Regional PD and Health Check create previously.

gcloud compute instance-groups managed create sql-smig-1 \
--base-instance-name=sql-smig-1 \
--template=sql-smig-template-1 \
--size=0 \
--zones=us-central1-c,us-central1-a \
--instance-redistribution-type=NONE \
--health-check=sql-port-hc-1 \
--initial-delay=400

e. Create an instance inside S-MIG.
Note: This step cannot be done via GUI.

export projectid=$(gcloud config get-value project)gcloud compute instance-groups managed create-instance sql-smig-1 \
--instance=smig-sql-inst-1 \
--stateful-disk device-name=sql-data-repd,auto-delete=never,source=projects/$projectid/regions/us-central1/disks/sql-data-repd,mode=rw \
--region=us-central1

This will create an instance inside S-MIG. Now you can RDP into this VM to format the attached stateful disk and give it drive letter. Formatting needs to be done only first time.
Next, create database with data location on this attached disk.

You will also need to create a startup script with commands to attach database upon automatic node repair. Then add this startup script to an instance template as well.

f. Update the S-MIG once to reconfigure and start automatic node repairs.

gcloud compute instance-groups managed update sql-smig-1 \
--region us-central1

g. Ideally, you should create an internal load balancer (ILB) to communicate with SQL Server. This is to avoid IP changes upon node repairs.

Testing Failover

Simply stop the SQL Server service and in a few seconds you would observe existing instance is replaced by a new one.

Summary

This setup is great if you wish to save on your infrastructure (and maybe licensing) costs while still having a multi zone data center setup. This is because you are not running any standby machines. Also, there is no data loss as data is being written simultaneously in two zone.
Only drawback is, you would need to tolerate recovery time of few minutes.

You can add similar DR too if you want to further increase availability.

Thanks

Thanks for reading my article. I would love to know your thoughts. Please let me know your suggestions and how can I improve my writing.

--

--