Disaster Recovery
If all active data centers become unavailable due to either an operational failure or an environmental disaster, then the passive data center takes over using a handoff procedure that determines a consistent state of the database.
During the handoff procedure, a few of the most recently committed transactions might be lost. |
The following figures illustrate the architecture:
The decision to abandon the active data center and hand-off the processing control to a passive data center is the responsibility of a human administrator or a user-supplied automated administrator. The handoff decision cannot be made by the NuoDB database itself, because it requires a more global view of the situation than is possessed by the Storage Manager (SM), Transaction Engine (TE), or Admin Process (AP). The decision requires knowledge of the connections between application processes and their end-users as well as between application processes and TEs, and an understanding of the level of service required at any point in time.
Before any change is made to the database, the administrator confirms the date and time to which the database state will be reverted. This prevents unexpected loss of a large amount of data.
The administrator performs handoff through NuoDB Command (nuocmd
) or the REST API request.
However, the decision to perform handoff remains the customer’s responsibility.
Steps of Handoff Procedure
The handoff procedure consists of the following steps:
-
Confirmation
The Confirmation step confirms that the passive data center that is going to become the new active data center can revert to a consistent state of the database that is adequately recent.
The administrator is responsible for:
-
Shutting down all remaining database processes
-
Starting a new AP in the passive data center
-
Ensuring that the new AP is aware that all processes in other data centers are gone (evicted)
-
Starting an SM for each archive in the passive data center to begin the confirmation step
-
Ensuring that no database processes started outside of handoff may join a database started for handoff until the Reprovisioning step.
Use the
nuocmd handoff report-timestamp
command, or the equivalentdatabases/<dbName>/reportTimestamp
REST API request, to compute the timestamp of the latest committed transaction present in the state that will be produced by the Resolution step. You must specify the archive ids of all archives to be used to reconstitute the database. These archives should all be in the formerly passive data center, to minimize network latency in the new active data center.The accuracy of the
timestamp
in the response depends on the accuracy of the system clocks on the TEs, which is the administrator’s responsibility. Thetimestamp
is the best available estimate since some commits done on a different TE just before that time could have been lost. For more details, seehandoff report-timestamp
.A sample
nuocmd handoff report-timestamp
command and response is shown below. Note that thetimestamp
is always in theUTC
time zone.nuocmd handoff report-timestamp --db-name DB_NAME --archive-ids 4 5 6
ReportTimestamp(commits=0,0,0,101,9955, epoch=345, leaders=2 6, timestamp=2021-01-06T13:33:07)
If it is not acceptable to reset the state of the database back to that of
timestamp
, the administrator stops the handoff procedure and shuts down the database. The administrator may then try to restore the database from the latest-available backup. -
-
Deprovisioning
The Deprovisioning step removes the resources (archives, TEs, SMs, and APs) in the failed active data center from the domain. Use the
nuocmd delete server
command. For more details, seedelete server
.A sample
nuocmd delete server
command is shown below.nuocmd delete server --server-id SERVER_ID
-
Resolution
The Resolution step returns the database to the most recent consistent state that was captured by Asynchronous Storage Managers (ASMs) in the passive data center. Use the
nuocmd handoff reset-state
command or the equivalentdatabases/<dbName>/resetState
REST API request to reset the state. Pass the input parameterscommits
,epoch
, andleaders
generated by the Confirmation step. For more details, seehandoff reset-state
.A sample
nuocmd handoff reset-state
command and response are shown below.nuocmd check database --db-name DB_NAME --check-syncing --num-processes num_processes --timeout 60
nuocmd handoff reset-state --db-name DB_NAME --commits 0 0 0 101 9955 --leaders 2 6 --epoch 345
State successfully reset
-
Promotion
The Promotion step changes the archives in the new active data center from passive to active. The ASMs become ordinary SMs. Use the
nuocmd set archive
command or the equivalentarchives/modifyObserverStatus/id
REST API request. For more details, seeset archive
.A sample
nuocmd set archive
command is shown below.nuocmd set archive --archive-id 4 --active
nuocmd set archive --archive-id 5 --active
nuocmd set archive --archive-id 6 --active
-
Reprovisioning
The Reprovisioning step starts as many TEs as required so they can resume serving the application workload together with the SMs promoted in the Promotion step. Use the
nuocmd start database
command or the equivalentdatabases/<dbName>/startplan
REST API request. For more details, seestart database
.A sample
nuocmd start database
command and response are shown below.nuocmd start database --db-name DB_NAME --incremental --te-server-ids TE_SERVER_ID
STARTING: ...
-
Migration
The Migration step moves the workload and migrates administrative activities, such as monitoring and backup, to the new active data center. Depending on the application architecture, the application processes might move to the new active data center or stay where they are and connect to TEs in the new active data center. The administrator is entirely responsible for this step.
-
Protection
The Protection step starts a new passive data center to protect against failure of the new active data center. This step is optional. The administrator is entirely responsible for this step.
For example, see Disaster Recovery Using Handoff - Example. For information about restarting NuoDB Admin after a disaster, see Re-establishing Admin Process (AP) Quorum.
Recovering from Failures During Handoff
If the Confirmation, Resolution, or Promotion steps fail for any reason, the failure should be addressed and handoff must be restarted from Confirmation. If any other steps fail, the administrator should address the cause of the error and only restart the step that failed. If any of the SMs taking part in handoff fail before the Promotion step is completed, all SMs taking part in handoff must be stopped and the administrator must restart handoff from the Confirmation step. If restarting from the beginning, rerun the Confirmation step even if the administrator is satisfied with the timestamp reported during the first handoff attempt, as this step allows ASMs to startup and peer without the presence of regular SMs.
Using the Handoff Database Command
NuoDB Command (nuocmd) includes handoff database
command to automate the execution of several steps needed for handoff procedure.
The nuocmd handoff database
command automatically runs the Confirmation, Resolution, and Promotion steps of the handoff procedure.
You are still responsible for the Deprovisioning, Reprovisioning, Migration, and Protection steps of handoff.
The Deprovisioning step can be done either before or after running the nuocmd handoff database
command.
The Reprovisioning, Migration, and Protection steps must be run after the nuocmd handoff database
command finishes.
There are two ways to specify the archives used to reconstitute the database for the nuocmd handoff database
command.
-
--archive-ids
: To provide a list of archives, use--archive-ids
. -
--all-observer-archive-ids
: To use all the observer archives, use--all-observer-archive-ids
. You can use--all-observer-archive-ids
when there is only one passive data center.
If the archives are not externally started, the handoff database
command will start the SM processes automatically.
If the archives are externally started, start an SM process on each of the specified archives before running this command.
Since you cannot confirm the results of the Confirmation step by manually supplying the values to the Resolution step, the argument --oldest-acceptable
can be used.
If the argument is supplied, the handoff database
command proceeds only if the Confirmation step reports a consistent state at least as recent as the --oldest-acceptable
.