About the Safe Commit Protocol

NuoDB strongly recommends that you use the safe commit protocol.

The safe commit protocol is enabled by default when you start a Transaction Engine.

Note: If you explicitly specified a commit protocol other than 'safe' and you would like to switch to 'safe', you need to restart the Transaction Engine(s) with the option "–commit safe" or edit the database's capture file and change the commit protocol option to 'safe' and restart the database.

Note: The capture file omits configuration options whose value is not the default, so if you never specified a commit protocol then the capture file will not specify it.

See also: Start Process and Restart Database.

Details about the safe commit protocol are organized as follows:

What the Safe Commit Protocol Does

The safe commit protocol works as follows for successful commits that insert, delete or update data:

  1. The client requests a commit.
  2. Pre-commit is only sent to storage managers that are leader candidates for storage groups modified by the transaction.

  3. Each available storage manager acknowledges the pre-commit.
  4. The transaction engine sends a commit message to all nodes.

Note: If a storage group goes offline, an ongoing transaction that modifies it will not be resolved as committed or failed until that storage group comes back online (or is deleted). The client waiting to commit that transaction will block. If the commit fails after all modified storage groups come back online the error message will be in the format "Transaction NNNNN failed because storage groups X, Y, Z went offline during commit".

Note: As for all commit protocols, the acknowledgment to the requesting client indicates that the transaction is visible to other clients. With safe commit, the acknowledgment to the requesting client also guarantees that the transaction commit is durable.

Even when a database uses the safe commit protocol, there are multiple ways that a transaction might be unsuccessful and these instances might have multiple outcomes.

About the Safe Commit Protocolversus running NuoDB Check

While safe commit helps to prevent the creation of data inconsistencies (e.g., errors that manifest as the "null descriptor" failure), it does not correct inconsistencies that are already written to cold atoms in the archive.

Running nuochk should find and repair issues such as a missing descriptor.

Nuochk can be run on the archive of a stopped storage manager while the database is running to look for issues without bringing the database down, so the initial check does not have to result in downtime. Repair can be done on individual tables, so the repair does not necessarily have to be slow.

Durability Under Safe Commit

Failure of a database process (TE or SM) is typically transient in that it can be restarted. For example, if a storage manager fails (perhaps due to power failure), you can usually resolve the failure (restore power) and restart the storage manager. In rare cases, storage media can suffer permanent failure that prevents the storage manager from being restarted. This results in the permanent loss of an archive or journal. A permanently failed database process cannot be restarted and so cannot resume serving the database. A failure event is over when the failed database process is replaced with a running database process. For example, a permanently lost disk failure event is over when the failed storage manager is replaced with a running storage manager.

An archive that permanently fails is no longer available. An unavailable archive might cause a database to be unable to enforce durability. This can happen if you need to perform a cold restart of the database and the missing, failed archive is the only archive that contains one or more updates.

Examples of Safe Commit Behavior

Consider the following database configuration:

Scenario 1 - Permanent Storage Loss without Cold Restart

In the following scenario, there is a permanent storage loss but a cold restart is not needed. The database continues to run and durability is not violated.

  1. TE1 commits transaction T1.
  2. SM1 crashes but the disk is not lost. SM2 continues running.
  3. TE1 commits transaction T2. This is allowed because when max-lost-archives is set to 0 then only 1 storage manager is required to be running in order to commit transactions that insert, delete or update data. Only SM2 has T2.
  4. Restart SM1.
  5. SM1 synchronizes with SM2. Both SM1 and SM2 have T2.
  6. SM2 crashes and cannot be restarted.
  7. Start SM3.
  8. SM3 synchronizes with SM1. Both SMs have T2.

Scenario 2 - Permanent Storage Loss with Cold Restart

In this scenario, a possible sequence of failures (perhaps a power outage) requires a cold restart, and there is a complicating factor of the permanent failure of SM2 leading to theloss of transaction T2. Therefore aa cold restart is required. Durability is violated because T2 was committed on only one archive (SM2) and that archive was permanently lost before another archive could synchronize with it.

  1. TE1 commits transaction T1.
  2. SM1 crashes but the disk is not lost. SM2 continues running.
  3. TE1 commits transaction T2, which modifies the database. Only SM2 has T2.
  4. SM2 crashes and cannot be restarted.
  5. The only transaction engine terminates. The database is down.
  6. Restart SM1. This archive defines the database.
  7. Start a TE. T2 is lost.

Scenario 3 - Permanent Storage Loss with Cold Restart

Here, we have the same as with max-lost-archives set to 1 instead of 0:

  1. TE1 commits transaction T1.
  2. SM1 crashes but the disk is not lost. SM2 continues running.
  3. TE1 tries to commit transaction T2, which would modify the database.
  4. NuoDB does not allow this commit because when max-lost archives is set to 1 then there is a requirement for 2 storage managers to be running in order to commit a transaction that updates, inserts or deletes data.
  5. SM2 crashes and cannot be restarted.
  6. The only transaction engine terminates. The database is down.
  7. Restart SM1. This archive defines the database.
  8. Start a TE. No transactions are lost.

In this scenario, durability is not violated. No transactions were committed when there was only one storage manager running. A cold restart was required and durability was maintained.

Comparison with remote:n Commit Protocol

With the safe commit protocol, you can add or remove storage managers without changing the database's configuration and durability continues to be guaranteed. The safe commit protocol always requires a commit acknowledgment from each running storage manager.

With the remote:n commit protocol, you can replace n with the number of the database's storage managers and achieve the same durability guarantee as the safe commit protocol. Remember that you have to update the value of n in your remote:n commit protocol.

For example, if your database uses two storage managers and the commit option is set to remote:2. You have the same durability guarantee as if the database was using the safe commit protocol. To continue to enforce the durability guarantee after you add a storage manager, you need to do the following:

  1. Shut down the database's running transaction engines.
  2. Perform the following two steps in either order:
    • Restart transaction engines with the commit database option set to remote:3.
    • Start the third storage manager.