This is a brain dump right now and will need to be prettied up before being presented to something like the mailing list, but I'm hoping interested parties can pick it apart at this stage so it can be firmed up.
Also, this document is only discussing Open Beta and Production. During Closed Beta we have been doing manual backups on an irregular (every 2-3 days) basis, but we know that it's not going to be fully safe.
In general, the rules are:
- Full backups of each database cluster on a weekly basis.
- Incremental backups done through storing binlogs as they become available.
- All backups stored off-site on server hosted by secondary company.
- Store all backups off-site for 30 days.
- Allow several seconds of loss in extreme situations.
Given our current operating budget and the kind of site we're dealing with, we are aiming for being in the 99.95% range operationally. I.e., we accept situations where we might lose a few seconds of data if a master dies hard.
Slicehost doesn't really provide plans with a lot of disk space. I intend to rent a disk-heavy server from a secondary service (Serverbeach probably) which will then store backups.
I feel that offsite backups are a requirement of a production service. Our particular configuration (using virtual servers for open beta) means that this is going to be a sizable cost in transfer. Backups are large, and we will be moving them frequently.
There is also the concern that we will be pegging our internal bandwidth limits. We had 40Mbps on our internal interfaces during closed beta, this might be higher in our new cluster? Will update when I know.
We will store at least 30 days of backups, giving us the ability to restore the site to any point in the past 30 days.
Given that MogileFS has good built-in redundancy onto multiple servers, I do not feel it is necessary to back these files up externally at this time. When we move to a colocated environment and don't have to worry about backup transfer counting against our quota and the high cost of disk storage in virtual servers, we can look at backing up the MogileFS cluster.
The global database is a master/slave configuration. There are one or more slaves depending on load. The slave database will be downed and backed up completely on a weekly basis during a period of low site load. The master will have binlogs saved to the backup server as they become fully written.
User clusters are configured in a master/master configuration. We will only have one machine active at a time, however. Backups will be done the same as for the global: binlogs archived off of the active master and the inactive master will be downed and backed up weekly.
Note to self: when doing backups, we will need to stop the SQL_THREAD, but leave the IO_THREAD running. This ensures that we continue to copy down writes from the master, but gives us a stable target (InnoDB files) to back up. If we stop the IO_THREAD we run the risk of losing all writes that happen during the period the slave is disconnected if the master were to fail.
There are several failure cases and potential recovery scenarios. Discussed here.
Slave / Inactive Master Failure
In the case that a slave or inactive master fails, a new virtual server will be spun up. The most recent backup of the newly dead machine will be restored, and replication can be started to pull it up to date.
There's an assumption in this section that the failure is catastrophic for the slave. There are other failure cases, but all of them have the same treatment in the case of virtual servers. Spin up a new slave and discard the old one. When we move to actually having hardware, then we will need to come up with more procedures for different failure modes.
No downtime of the site will be experienced. There is a slight chance that if a global database dies and we do not have enough capacity in the remaining slaves (if any) then the site will become sluggish and start queueing requests.
There is no data loss potential with this failure.
If a master (global database master or user cluster active master) fails, there will be site downtime. Something will become unavailable until an administrator steps in to resolve the problem. Either the entire site (global master down) or just content from one cluster.
The recovery process depends on which master was lost. Read on.
The site is hard down while the global master is offline. The most likely recovery solution will be to promote an existing global slave to be the new master and build a new slave. This is not going to be a short process, the site will be down for a few hours.
Potential for data loss: any transactions that were uncommitted on the master will be lost. Additionally, we will lose data for any unreplicated writes. (This assumes that the master is a total loss and we can not retrieve any data from it.)
If we manage load such that replication lag is never more than a few seconds, then at most we will lose writes from those few seconds. While this is not good - it is a known risk with master/slave setups. It is currently unknown how likely we are to face a total loss scenario on a Slicehost server.
Data for users on a particular cluster is unavailable while that master is down. Other users will still be able to use the site. Recovery is actually fairly immediate: we can flip the active database bit and have that cluster back online in minutes.
The data loss analysis is the same. Transactions that didn't make it to the replication slave (inactive master) will be lost. Replication lag needs to be monitored and kept short.