Software and operations engineers are a passionate bunch. If you were to listen in on a group conversation, you’d hear a lot of strong opinions. People caring so much about the jobs they do and the products they build is ultimately a great thing. But if there’s one thing we can all agree on, it’s that backups should be tested and stored off site.
Ensuring backups are reliable is a crucial part of the job of any DevOps team. The best-case scenario is that you’ll never need to restore from backup. The worst-case scenario is that you will—and that you will find out your backups have failed or that you haven’t tested a restore.
At Sprout Social, we use a number of clustering technologies to provide resilience to hardware and software failures, but in the unlikely event we lose an entire cluster at once, we want to be sure we can bring it up again with minimal fuss. Let’s take a look behind the curtain at our backup strategy for MySQL databases.
What We’re Not Doing
Many junior database administrators may start out with a simple approach: writing a script that loops through each table in a given database and performs a mysqldump for that table. In order to keep the data consistent, they’ll probably run this on a slave replica, stopping replication before the process starts and starting it again once it’s finished. There are a some issues with this approach, but before we dive into those, let’s take a moment to look at how replication works.
MySQL replication is the go-to solution for whenever your workload grows too large for a single server. In a typical cluster of MySQL database servers, there is one master and any number of slaves. Any time data is changed (e.g., you publish a post or receive a message in your inbox) that change is sent to the master. The master then hands that change off to all the slaves, basically saying, “Hey, I made this change; you should make it too.” Anytime you want to look at data that hasn’t changed that query gets directed to a slave.
If you’ve been paying close attention, you’ve probably guessed that in order for replication to effectively distribute work across the cluster, you need to have many more read-only queries than queries that make changes, and you’d be right. There’s really no such thing as a typical workload, but most web applications have upwards of 60% read-only queries.
Problems With the Simple Approach
Remember our junior database administrator? Her solution does everything we want it to, but it has some problems.
The first is resource usage. It’s important for backups to be consistent, a point-in-time picture of your data. If two tables are backed up at different times, even slightly, that can cause big problems later on when we go to restore. Our junior database administrator got about that problem by stopping replication, effectively pausing time as far as our database is concerned, but that leaves us with a slave server that we can’t otherwise use for servicing normal work while the backup is running. One solution to this is a dedicated server for backups, but then we have a server that’s just sitting idle except for when backups are running.
The second issue with the simple approach is time. Remember how I said we’d do a mysqldump of each table? mysqldump turns binary computer code inside the database into SQL (Structured Query Language) statements. Unfortunately, converting from binary to SQL takes a long time. It takes even longer to run the same process in reverse, should we ever need to do a restore. So why would our junior database administrator take this approach? Because she could read the output to verify the backup is doing what it’s supposed to do. This might be OK for small amounts of data, but clearly, such manual verification of backups won’t scale.
A DevOps Approach
Our junior database administrator may also have chosen mysqldump because it’s a tool she’s familiar with. There are better tools for this task, but the structure of traditional technology organizations isn’t always conducive to knowledge sharing across teams, so a database administrator, especially a junior one, might not know about or have access to them.
At Sprout, we use a DevOps methodology that is designed to punch holes through—if not completely obliterate—knowledge silos. Let’s see if we can apply this to our backup problem.
Beneath the database itself is a disk with a filesystem. Modern filesystems have the ability to create snapshots. A snapshot is a copy of all the data on the filesystem at a given point in time. While a snapshot exists, we incur some extra overhead, so we don’t want them to exist for too long. But because the filesystem operates at a very low level, we can snapshot our database’s information without MySQL ever being aware we’ve done it.
A perfectly consistent point in time copy of our data while the database continues to run? That sounds like exactly what we need. Our process now looks like this:
- Create a filesytem snapshot.
- Do something with the files.
- Remove the snapshot.
- We’re done (but we do need to zoom in on Step 2 a bit).
Remember when I said backups should be stored off site? It’s also true that backups should be stored close by (we don’t want to have to wait for them to be delivered when we need them.) What we really want is a both/and approach. Since “close by” usually means “fast,” we can do that while our snapshot is active. Afterwards, we’ll have a backup that we copy off site too.
Testing the Restore
So far, our process has four steps:
- Create a snapshot.
- Create a local backup.
- Remove the snapshot.
- Copy the local backup somewhere off site.
The part that’s missing is making sure we can restore when we need to. We need to add another step:
5. Test the backup we just created.
Our team at Sprout likes storing backups in Amazon S3. It’s reliable, which is vital for backups, and it supports automatic encryption and fine-grained access control, which keeps our security team happy. The first step in testing a restore is to get the file back out of S3. We do this with an Amazon EC2 server for the best performance. There is a challenge here in that work has to be coordinated between the instances performing the backups and the instances performing the restore.
To solve this challenge, we created a messaging system between the database servers and the EC2 server. As soon as a backup is created and uploaded, we send a message to the EC2 server. The EC2 server will then download the backup we just uploaded and restore it. If anything goes wrong at any point during the entire process, an alert is sent to our Operations Engineers.
We’ve achieved the holy grail of database backup solutions, and it’s completely automatic. Sleep better knowing your data is safe, even in the worst of cases.
Allie Richards: As Sprout Social's DevOps Manager, Allie Richards is responsible for keeping data safe and service available. She is also Sprout's crazy cat lady in residence.