Our application relies on MySQL, and we’ve got a simple master-slave setup. That solution, though it works for us now, cannot follow us to AWS. It’s not reliable enough. We have to plan on any given availability zone going down, and taking all of its instances with it. When this happens to us, we want it to be a zero-downtime event.
So we put on our thinking caps, and came up with a number of solutions that could work for us:
- Active-Passive Shared Storage
- Galera or Percona XtraDB Cluster
Many of these solutions were described quite nicely in this slide deck by Oracle.
Which was right for us? (And would it be right for you?)
We aren’t the types to pick a solution out of a hat? We knew that the technology we chose would power our business for years to come. We owed it to our customers to do it right. So what did “right” mean for us?
We came up with almost 20 criteria that would guide our decision, presented in no particular order of importance.
- Be ACID compliant. A significant portion of our data requires this. Other data might live in something like Cassandra.
- SSL encryption between app and nodes. Since we’re dealing with patient health information, the FDA requires that we encrypt all data in motion.
- SSL encryption between nodes.Ditto.
- Servers running in multiple availability zones. In case one availability zone goes down.
- Compatible with our current application.We didn’t want to incorporate new drivers or change how our application interacts with the database.
- Compatible with MySQL. We didn’t want to re-write our existing queries.
- Export all data. We didn’t want a proprietary solution that would make it hard for us to get our data out.
- Have a place to run long-running queries. There needed to be a safe place to run manual, long-running queries.
- No single point of failure. When we’re aiming for high-availability, a SPF sorta goes against the grain.
- Scale reads well. The more customers we get, the more reads we’re going to have to do.
- Scale writes well. Ditto. But just because a solution scales reads well, doesn’t mean the solution scales writes well too.
- Automatically fail over in case of AZ outage. Some might disagree, and want human intervention. But we’re a small team, and we didn’t want our fingers to be the bottleneck when things go down.
- Automatically fail over in case of instance outage. Ditto.
- Automatically join after failover. Ditto ditto.
- Known to run on CentOS or a variant. The less we had to learn about a new OS, or change about our automation to support a new OS, the better.
- Packages exist for the tool. I HATE compiling from source, making my own package, etc. Ick.
- Integrates with existing logging. We liked our logging and monitoring set up, and wanted our solution plays nicely with that.
- Avoid the politics associated with Oracle MySQL. I mean, this wasn’t the most important thing to get in a tizzy about, but if we can support someone other than Oracle in this, we wanted to.
When all was said and done, we found two solutions that stood out among the rest:
- Galera or Percona XtraDB Cluster
- Active/Passive with Shared Storage
The Cluster solution won out, primarily because we liked the way it scaled better. Making the decision of Galera vs. Percona XtraDB was straight-forward: we were using other Percona tools, and decided to stick with one DB vendor.
2 thoughts on “Researching High Availability MySQL: How we chose Percona’s XtraDB”
I agree with you that probably Galera (and think Percona XtraDB cluster is the best available distribution) is one of the best currently-stable solutions for anything close to a “pseudo-synchronous” replication. Facebook has rolled out its own solution, too, based on semi-synchronous replication.
But before you gave your first requirement, there was a 99.999% of possibilities that NDB wasn’t going to be an option. Let’s see how Oracle reacts with its promised syncronous Multi-master replication in MySQL 5.7.
By the way, regarding the Oracle slides -I was suspicious of them officially mentioning Galera, and I was right. Probably hell will have to be frozen before them acknowledging it, which for me is a stupid policy.
[…] a recent post, I mentioned that we needed a MySQL clustering solution that scaled well with both reads and […]