If all active data centers become unavailable due to either an operational failure or an environmental disaster, then the passive data center takes over using a handoff procedure that determines a consistent state of the database.
|During the handoff procedure, a few of the most recently committed transactions might be lost.|
The following figures illustrate the architecture:
The decision to abandon the active data center and hand-off the processing control to a passive data center is the responsibility of a human administrator or a user-supplied automated administrator. The handoff decision cannot be made by the NuoDB database itself, because it requires a more global view of the situation than is possessed by the Storage Manager (SM), Transaction Engine (TE), or Admin Process (AP). The decision requires knowledge of the connections between application processes and their end-users as well as between application processes and TEs, and an understanding of the level of service required at any point in time.
Before any change is made to the database, the administrator confirms the date and time to which the database state will be reverted. This prevents unexpected loss of a large amount of data.
The administrator performs handoff through NuoDB Command (
nuocmd) or the REST API request.
However, the decision to perform handoff remains the customer’s responsibility.
Steps of Handoff Procedure
The handoff procedure consists of the following steps:
The Confirmation step confirms that the passive data center that is going to become the new active data center can revert to a consistent state of the database that is adequately recent.
The administrator is responsible for:
Shutting down all remaining database processes
Starting a new AP in the passive data center
Ensuring that the new AP is aware that all processes in other data centers are gone (evicted)
Starting an SM for each archive in the passive data center to begin the confirmation step
Ensuring that no database processes started outside of handoff may join a database started for handoff until the Reprovisioning step.
nuocmd handoff report-timestampcommand, or the equivalent
databases/<dbName>/reportTimestampREST API request, to compute the timestamp of the latest committed transaction present in the state that will be produced by the Resolution step. You must specify the archive ids of all archives to be used to reconstitute the database. These archives should all be in the formerly passive data center, to minimize network latency in the new active data center.
The accuracy of the
timestampin the response depends on the accuracy of the system clocks on the TEs, which is the administrator’s responsibility. The
timestampis the best available estimate since some commits done on a different TE just before that time could have been lost. For more details, see
nuocmd handoff report-timestampcommand and response is shown below. Note that the
timestampis always in the
nuocmd handoff report-timestamp --db-name DB_NAME --archive-ids 4 5 6
ReportTimestamp(commits=0,0,0,101,9955, epoch=345, leaders=2 6, timestamp=2021-01-06T13:33:07)
If it is not acceptable to reset the state of the database back to that of
timestamp, the administrator stops the handoff procedure and shuts down the database. The administrator may then try to restore the database from the latest-available backup.
The Deprovisioning step removes the resources (archives, TEs, SMs, and APs) in the failed active data center from the domain. Use the
nuocmd delete servercommand. For more details, see
nuocmd delete servercommand is shown below.
nuocmd delete server --server-id SERVER_ID
The Resolution step returns the database to the most recent consistent state that was captured by Asynchronous Storage Managers (ASMs) in the passive data center. Use the
nuocmd handoff reset-statecommand or the equivalent
databases/<dbName>/resetStateREST API request to reset the state. Pass the input parameters
leadersgenerated by the Confirmation step. For more details, see
nuocmd handoff reset-statecommand and response are shown below.
nuocmd check database --db-name DB_NAME --check-syncing --num-processes num_processes --timeout 60
nuocmd handoff reset-state --db-name DB_NAME --commits 0 0 0 101 9955 --leaders 2 6 --epoch 345
State successfully reset
The Promotion step changes the archives in the new active data center from passive to active. The ASMs become ordinary SMs. Use the
nuocmd set archivecommand or the equivalent
archives/modifyObserverStatus/idREST API request. For more details, see
nuocmd set archivecommand is shown below.
nuocmd set archive --archive-id 4 --active
nuocmd set archive --archive-id 5 --active
nuocmd set archive --archive-id 6 --active
The Reprovisioning step starts as many TEs as required so they can resume serving the application workload together with the SMs promoted in the Promotion step. Use the
nuocmd start databasecommand or the equivalent
databases/<dbName>/startplanREST API request. For more details, see
nuocmd start databasecommand and response are shown below.
nuocmd start database --db-name DB_NAME --incremental --te-server-ids TE_SERVER_ID
The Migration step moves the workload and migrates administrative activities, such as monitoring and backup, to the new active data center. Depending on the application architecture, the application processes might move to the new active data center or stay where they are and connect to TEs in the new active data center. The administrator is entirely responsible for this step.
The Protection step starts a new passive data center to protect against failure of the new active data center. This step is optional. The administrator is entirely responsible for this step.
For example, see Disaster Recovery Using Handoff - Example. For information about restarting NuoDB Admin after a disaster, see Re-establishing Admin Process (AP) Quorum.
Recovering from Failures During Handoff
If the Confirmation, Resolution, or Promotion steps fail for any reason, the failure should be addressed and handoff must be restarted from Confirmation. If any other steps fail, the administrator should address the cause of the error and only restart the step that failed. If any of the SMs taking part in handoff fail before the Promotion step is completed, all SMs taking part in handoff must be stopped and the administrator must restart handoff from the Confirmation step. If restarting from the beginning, rerun the Confirmation step even if the administrator is satisfied with the timestamp reported during the first handoff attempt, as this step allows ASMs to startup and peer without the presence of regular SMs.
Using the Handoff Database Command
NuoDB Command (nuocmd) includes
handoff database command to automate the execution of several steps needed for handoff procedure.
nuocmd handoff database command automatically runs the Confirmation, Resolution, and Promotion steps of the handoff procedure.
You are still responsible for the Deprovisioning, Reprovisioning, Migration, and Protection steps of handoff.
The Deprovisioning step can be done either before or after running the
nuocmd handoff database command.
The Reprovisioning, Migration, and Protection steps must be run after the
nuocmd handoff database command finishes.
There are two ways to specify the archives used to reconstitute the database for the
nuocmd handoff database command.
--archive-ids: To provide a list of archives, use
--all-observer-archive-ids: To use all the observer archives, use
--all-observer-archive-ids. You can use
--all-observer-archive-idswhen there is only one passive data center.
If the archives are not externally started, the
handoff database command will start the SM processes automatically.
If the archives are externally started, start an SM process on each of the specified archives before running this command.
Since you cannot confirm the results of the Confirmation step by manually supplying the values to the Resolution step, the argument
--oldest-acceptable can be used.
If the argument is supplied, the
handoff database command proceeds only if the Confirmation step reports a consistent state at least as recent as the