Asynchronous Commit

Asynchronous commit allows all database changes to be replicated to a remote passive data center without the database slowing down because of speed-of-light delay in communication between remote locations. In the event of a disaster, a recent, consistent snapshot of the database contents can be recovered and quickly put into service.

The remote data center contains only Admin Processes and Asynchronous Storage Managers.

Disaster recovery

Redundant SMs running on independent physical hosts and keeping their archives and journals on independent physical storage protect against single points of failure. But what about a massive multi-node failure?

If the entire data center is permanently destroyed by fire, hurricane, or earthquake, all redundant SMs' archives will be lost. If the data center is not permanently destroyed but is temporarily unavailable for longer than the acceptable maximum outage time, or provides unacceptably degraded service, the effect is the same.

Without asynchronous commit, the database will have to be restored from a backup to new archives in a new data center. Restoring a large database from a backup can take a long time. Changes made since the most recent backup will be lost.

Asynchronous commit provides quicker disaster recovery with less lost data than relying on backups, at the cost of additional hardware resources to run a remote data center. Changes to the database are replicated as they occur to Asynchronous Storage Managers in a remote passive data center sufficiently distant from the active data center that a single disaster cannot take out both data centers.

Asynchronous Storage Managers are passive and do not participate in NuoDB’s performance-sensitive network protocols. The advantage is that high network latency between remote data centers does not slow down the workload. Performance-sensitive protocols execute entirely inside the active data center, where network latency is low. The disadvantage is that disaster recovery can lose some of the most recently committed transactions.

You may still want periodic backups so you can undo unintended changes to the database by rolling back to a previous day. Unintended changes will be replicated continuously like any other changes.

Handoff after a disaster

If the active data center is destroyed or becomes unavailable, the passive data center takes over, using a "handoff" procedure that determines a consistent state of the database. Because Asynchronous Storage Managers do not participate in NuoDB’s performance-sensitive network protocols, their most recent state might be inconsistent. Handoff includes a reset-state operation that moves the database state backwards in time to the most recent consistent state. This might lose some of the most recent committed transactions, but does recover the most recent consistent state available.

The Asynchronous Storage Managers become ordinary Storage Managers. As many Transaction Engines as required are started and resume serving the application workload.

The following figures illustrate the architecture:

active passive before
Figure 1. Before any disaster occurs, a database with one active data center and two passive data centers looks like this.
active passive after
Figure 2. After recovery from a disaster that destroys the active data center the database looks like this.

The decision to abandon the active data center and hand off responsibility to a passive data center can be taken by a human administrator or by a user-supplied automated administrator. The handoff decision cannot be made by NuoDB itself, because it requires a more global view of the situation than is possessed by any NuoDB engine or Admin Process, including visibility of connections between application processes and their end-users as well as between application processes and Transaction Engines, an understanding of what level of service is required at which times of day, week, or year, and the ability to redirect the application workload to the new active data center. A new passive data center can be added to provide for recovery from the next disaster.

The administrator drives handoff through nuocmd or through the REST API. Handoff can be a scriptable operation that requires no human interaction. However, the decision to perform handoff remains the customer’s responsibility.