Disaster Recovery Using Asynchronous Commit

If all active data centers are destroyed or become unavailable, a passive data center takes over, using a handoff procedure that determines a consistent state of the database. This might lose a small number of the most recent committed transactions.

The decision to abandon the active data center and hand off responsibility to a passive data center can be taken by a human administrator or by a user-supplied automated administrator. The handoff decision cannot be made by NuoDB itself, because it requires a more global view of the situation than is possessed by any NuoDB engine or admin process, including visibility of connections between application processes and their end-users as well as between application processes and TEs, and an understanding of what level of service is required at which times of day, week, or year.

Before any changes are made to the database, the administrator confirms the date and time to which the database state will be reverted. This prevents unexpected loss of a large amount of data.

Handoff consists of the following seven steps:

Confirmation

Confirmation confirms that the passive data center that is going to become the new active data center can revert to a consistent state of the database that is adequately recent.

The administrator is responsible for shutting down any remaining database processes, starting a new admin process in the passive data center, making sure that the new admin process is aware that all processes in other data centers are gone (evicted), and starting an SM for each archive in the passive data center to begin the confirmation step. See Re-establishing Admin Process Quorum for information about restarting nuoadmin after a disaster. See Asynchronous Commit Example for details.

The administrator then uses the handoff report-timestamp command or the equivalent databases/db_name/reportTimestamp REST API to compute the time stamp of the latest committed transaction present in the state that will be produced by the Resolution step. The administrator must specify the archive ids of all archives to be used to reconstitute the database. These archives should all be in the same data center, to minimize network latency in the new active data center.

The accuracy of this time stamp depends on the accuracy of the clocks on the TEs, which is the administrator’s responsibility. The time stamp is only approximate since some commits just before that time done on a different TE could have been lost, but it is the best available estimate. For example,

nuocmd handoff report-timestamp --db-name ... --archive-ids 4 5 6
ReportTimestamp(commits=0,0,0,101,9955, epoch=345, leaders=2 6, timestamp=2021-01-06T13:33:07)

Note that the timestamp is always in the UTC timezone, not the local timezone.

If it is not acceptable to reset the database state back to that date and time, the administrator stops the handoff procedure here and tries something else, such as restoring from the latest-available backup.

Deprovisioning

Deprovisioning removes the resources (archives, engines, and admins) in the failed active data center from the admin’s model of the database. For example,

nuocmd delete server --server-id ...

Resolution

Resolution returns the database to the most recent consistent state that was captured by observers in the passive data center that is going to become the new active data center. The administrator uses the handoff reset-state command or the equivalent databases/db_name/resetState REST API to reset the state, passing in parameters generated by the confirmation step. For example,

nuocmd check database --db-name ... --check-syncing --num-processes ... --timeout 60
nuocmd handoff reset-state --db-name ... --commits 0 0 0 101 9955 --leaders 2 6 --epoch 345
State successfully reset

Promotion

Promotion durably changes the archives in the new active data center from passive to active. The Asynchronous Storage Managers become ordinary Storage Managers. The administrator uses the set archive command or the equivalent archives/modifyObserverStatus REST API. For example,

nuocmd set archive --archive-id 4 --no-observer unpartitioned
Archive successfully modified
nuocmd set archive --archive-id 5 --no-observer unpartitioned
Archive successfully modified
nuocmd set archive --archive-id 6 --no-observer unpartitioned
Archive successfully modified

Reprovisioning

As many Transaction Engines as required should be started so they can resume serving the application workload together with the Storage Managers promoted in the previous step. The administrator uses the start database command or the equivalent databases/db_name/startplan REST API. For example,

nuocmd start database --db-name ... --incremental --te-server-ids ...
STARTING: ...

Migration

Migration moves the workload to the new active data center and migrates administrative activities such as monitoring and backup to there. Application processes might move to the new active data center, or might stay where they are and just connect to TEs in the new active data center, depending on the application architecture. The administrator is entirely responsible for this step.

Protection

Protection starts a new passive data center to protect against failure of the new active data center. This step is optional. The administrator is entirely responsible for this step.

Recovering from Failures During Handoff

If Confirmation, Resolution, or Promotion fail for any reason, the failure should be addressed and handoff must be restarted from Confirmation. For all other steps, the administrator should address the cause of the error and restart just the step that failed. If any of the storage managers taking part in handoff fail before the Promotion step completes successfully, all storage managers taking part in handoff must be stopped and the administrator must restart handoff from Confirmation. If restarting from the beginning, Confirmation needs to be rerun even if the administrator is satisfied with the timestamp reported during the first handoff attempt as this step allows asynchronous storage managers to startup and peer without the presence of regular SMs.

Using the Handoff Database Command

In order to make automating the handoff procedure easier, nuocmd includes a command, handoff database, that runs several of the steps needed for handoff as part of a single command.

The handoff database command provides two ways to specify the archives to be used to reconstitute the database. The user may either provide a list of archives with the --archive-ids argument as is done with the handoff report-timestamp command, or, if the user wishes to use all observer archives, the switch --all-observer-archive-ids should be provided. You can use --all-observer-archive-ids when there is only one passive data center. If the archives are not externally started, the handoff database command will start the SM processes automatically. If the archives are externally started, the user is responsible for starting an SM process on each of the specified archives before running this command.

The handoff database command automatically runs the Confirmation, Resolution, and Promotion steps of handoff. As the user is no longer able to confirm the results of the Confirmation step by manually supplying the values to the Resolution step, the handoff database command provides an argument, --oldest-acceptable, where, if supplied, the command only proceeds if Confirmation reports a consistent state as least as recent as the provided timestamp.

The user is still responsible for the Deprovisioning, Reprovisioning, Migration, and Protection steps of handoff. Deprovisioning can be done either before or after running the handoff database command. Reprovisioning, Migration, and Protection must be run after the handoff database command finishes.