Disaster Recovery

If all active data centers become unavailable due to either an operational failure or an environmental disaster, then the passive data center takes over using a handoff procedure that determines a consistent state of the database.

During the handoff procedure, a few of the most recently committed transactions might be lost.

The following figures illustrate the architecture:

Figure 1. Database with one active data center and two passive data centers

Figure 2. Database during disaster recovery

The decision to abandon the active data center and hand-off the processing control to a passive data center is the responsibility of a human administrator or a user-supplied automated administrator. The handoff decision cannot be made by the NuoDB database itself, because it requires a more global view of the situation than is possessed by the Storage Manager (SM), Transaction Engine (TE), or Admin Process (AP). The decision requires knowledge of the connections between application processes and their end-users as well as between application processes and TEs, and an understanding of the level of service required at any point in time.

Before any change is made to the database, the administrator confirms the date and time to which the database state will be reverted. This prevents unexpected loss of a large amount of data.

The administrator performs handoff through NuoDB Command (nuocmd) or the REST API request. However, the decision to perform handoff remains the customer’s responsibility.

Steps of Handoff Procedure

The handoff procedure consists of the following steps:

Confirmation
The Confirmation step confirms that the passive data center that is going to become the new active data center can revert to a consistent state of the database that is adequately recent.

The administrator is responsible for:
- Shutting down all remaining database processes
- Starting a new AP in the passive data center
- Ensuring that the new AP is aware that all processes in other data centers are gone (evicted)
- Starting an SM for each archive in the passive data center to begin the confirmation step
- Ensuring that no database processes started outside of handoff may join a database started for handoff until the Reprovisioning step.
Use the nuocmd handoff report-timestamp command, or the equivalent databases/<dbName>/reportTimestamp REST API request, to compute the timestamp of the latest committed transaction present in the state that will be produced by the Resolution step. You must specify the archive ids of all archives to be used to reconstitute the database. These archives should all be in the formerly passive data center, to minimize network latency in the new active data center.

The accuracy of the timestamp in the response depends on the accuracy of the system clocks on the TEs, which is the administrator’s responsibility. The timestamp is the best available estimate since some commits done on a different TE just before that time could have been lost. For more details, see handoff report-timestamp.

A sample nuocmd handoff report-timestamp command and response is shown below. Note that the timestamp is always in the UTC time zone.
nuocmd handoff report-timestamp --db-name DB_NAME --archive-ids 4 5 6
ReportTimestamp(commits=0,0,0,101,9955, epoch=345, leaders=2 6, timestamp=2021-01-06T13:33:07)
If it is not acceptable to reset the state of the database back to that of timestamp, the administrator stops the handoff procedure and shuts down the database. The administrator may then try to restore the database from the latest-available backup.
Deprovisioning
The Deprovisioning step removes the resources (archives, TEs, SMs, and APs) in the failed active data center from the domain. Use the nuocmd delete server command. For more details, see delete server.

A sample nuocmd delete server command is shown below.
nuocmd delete server --server-id SERVER_ID
Resolution
The Resolution step returns the database to the most recent consistent state that was captured by Asynchronous Storage Managers (ASMs) in the passive data center. Use the nuocmd handoff reset-state command or the equivalent databases/<dbName>/resetState REST API request to reset the state. Pass the input parameters commits, epoch, and leaders generated by the Confirmation step. For more details, see handoff reset-state.

A sample nuocmd handoff reset-state command and response are shown below.
nuocmd check database --db-name DB_NAME --check-syncing --num-processes num_processes --timeout 60
nuocmd handoff reset-state --db-name DB_NAME --commits 0 0 0 101 9955 --leaders 2 6 --epoch 345
State successfully reset
Promotion
The Promotion step changes the archives in the new active data center from passive to active. The ASMs become ordinary SMs. Use the nuocmd set archive command or the equivalent archives/modifyObserverStatus/id REST API request. For more details, see set archive.

A sample nuocmd set archive command is shown below.
nuocmd set archive --archive-id 4 --active
nuocmd set archive --archive-id 5 --active
nuocmd set archive --archive-id 6 --active
Reprovisioning
The Reprovisioning step starts as many TEs as required so they can resume serving the application workload together with the SMs promoted in the Promotion step. Use the nuocmd start database command or the equivalent databases/<dbName>/startplan REST API request. For more details, see start database.

A sample nuocmd start database command and response are shown below.
nuocmd start database --db-name DB_NAME --incremental --te-server-ids TE_SERVER_ID
STARTING: ...
Migration

The Migration step moves the workload and migrates administrative activities, such as monitoring and backup, to the new active data center. Depending on the application architecture, the application processes might move to the new active data center or stay where they are and connect to TEs in the new active data center. The administrator is entirely responsible for this step.
Protection

The Protection step starts a new passive data center to protect against failure of the new active data center. This step is optional. The administrator is entirely responsible for this step.

For example, see Disaster Recovery Using Handoff - Example. For information about restarting NuoDB Admin after a disaster, see Re-establishing Admin Process (AP) Quorum.

Recovering from Failures During Handoff

If the Confirmation, Resolution, or Promotion steps fail for any reason, the failure should be addressed and handoff must be restarted from Confirmation. If any other steps fail, the administrator should address the cause of the error and only restart the step that failed. If any of the SMs taking part in handoff fail before the Promotion step is completed, all SMs taking part in handoff must be stopped and the administrator must restart handoff from the Confirmation step. If restarting from the beginning, rerun the Confirmation step even if the administrator is satisfied with the timestamp reported during the first handoff attempt, as this step allows ASMs to startup and peer without the presence of regular SMs.

Using the Handoff Database Command

NuoDB Command (nuocmd) includes handoff database command to automate the execution of several steps needed for handoff procedure. The nuocmd handoff database command automatically runs the Confirmation, Resolution, and Promotion steps of the handoff procedure. You are still responsible for the Deprovisioning, Reprovisioning, Migration, and Protection steps of handoff. The Deprovisioning step can be done either before or after running the nuocmd handoff database command. The Reprovisioning, Migration, and Protection steps must be run after the nuocmd handoff database command finishes.

There are two ways to specify the archives used to reconstitute the database for the nuocmd handoff database command.

--archive-ids: To provide a list of archives, use --archive-ids.
--all-observer-archive-ids: To use all the observer archives, use --all-observer-archive-ids. You can use --all-observer-archive-ids when there is only one passive data center.

If the archives are not externally started, the handoff database command will start the SM processes automatically. If the archives are externally started, start an SM process on each of the specified archives before running this command.

Since you cannot confirm the results of the Confirmation step by manually supplying the values to the Resolution step, the argument --oldest-acceptable can be used. If the argument is supplied, the handoff database command proceeds only if the Confirmation step reports a consistent state at least as recent as the --oldest-acceptable.