About the Safe Commit Protocol
The safe commit protocol coordinates messages to the journal with transaction commits.
When you start a Transaction Engine (TE), the safe
commit protocol is enabled by default.
NuoDB strongly recommends that you continue to use the safe commit protocol.
The commit protocol is set using the NuoDB Command tool.
To set the database default commit protocol at database create time, use nuocmd create database
with the --default-options commit <value>
option.
If you have changed the database default commit protocol and would like to revert to using the safe
commit protocol, then
-
Shutdown the database
-
Run
nuocmd update database-options
with the--default-options commit safe
option -
Restart the database using
nuocmd start database
Alternatively, to avoid application downtime,
-
Run
nuocmd update database-options
with the--default-options commit safe
option -
Restart the database processes in a rolling restart using
nuocmd start process
For reference on how to use these commands, see
nuocmd create database ,
nuocmd update database-options ,
nuocmd start database , and
nuocmd start process
|
What the Safe Commit Protocol Does
For commits that insert, delete or update data, the safe commit protocol works as follows:
-
The client requests a commit.
-
Pre-commit is only sent to Storage Managers (SMs) that are leader candidates for storage groups modified by the transaction.
-
Each available SM acknowledges the pre-commit after making it durable so it cannot be lost if the SM, or the host it is running on, crashes.
-
The TE sends a commit message to all nodes.
If a storage group goes offline, an ongoing transaction that modifies it is not be resolved until that storage group comes back online (or is deleted). If the commit fails after all modified storage groups come back online, the following error message is returned: |
Transaction NNNNN failed because storage groups X, Y, Z went offline during commit.
Even though a database uses the safe commit protocol, transactions may still be unsuccessful. However, safe commit guarantees that committed transactions are durable. |
Durability Under Safe Commit
Failure of a database process (TE or SM) is typically transient in that it can be restarted. For example, if an SM fails (perhaps due to power failure), you can usually resolve the failure (restore power) and restart the SM. In rare cases, storage media can suffer permanent failure that prevents the SM from being restarted. This results in the permanent loss of an archive or journal. A permanently failed database process cannot be restarted and so cannot resume serving the database. In the event that a database process has permanently failed, the solution is to replace the failed process with a running database process. For example replacing a failed SM with a running SM.
An archive that permanently fails is no longer available. An unavailable archive may prevent a database being able to enforce durability. This can happen if you need to perform a cold restart of the database and the missing, failed archive is the only archive that contains one or more updates.
Examples of Safe Commit Behavior
Consider the following database configuration:
-
The
--commit
option is set tosafe
. -
The
--max-lost-archives
option remains set to the default value of0
. -
One Transaction Engine (TE1) is running.
-
Two Storage Managers (SM1 and SM2) are running.
Scenario 1 - Permanent Storage Loss without Cold Restart
In this scenario, there is a permanent storage loss but a cold restart is not needed. The database continues to run and durability is not violated.
-
TE1 commits Transaction 1 (T1).
-
SM1 crashes but the disk is not lost. SM2 continues running.
-
TE1 commits Transaction 2 (T2).
This is allowed because whenmax-lost-archives
is set to0
only one SM needs to be running for insert, delete or update transactions to be committed. At this time, only SM2 has T2. -
SM1 is restarted by the DBA.
-
SM1 synchronizes with SM2.
At this time, both SM1 and SM2 have T2. -
SM2 crashes and cannot be restarted.
-
SM3 is started by the DBA.
-
SM3 synchronizes with SM1.
At this time, both SM1 and SM2 have T2.
Scenario 2 - Permanent Storage Loss with Cold Restart
In this scenario, a sequence of failures (perhaps due to a power outage) has occurred. The permanent failure of SM2 has resulted in the loss of T2, therefore a cold restart is required. Durability is violated because T2 was committed on only one archive (SM2) and that archive was permanently lost before another archive could synchronize with it.
-
TE1 commits T1.
-
SM1 crashes but the disk is not lost. SM2 continues running.
-
TE1 commits transaction T2 which modifies the database.
At this time, only SM2 has T2. -
SM2 crashes and cannot be restarted.
As the only TE has failed, the database is down. -
SM1 is restarted by the DBA.
SM1’s archive defines the database. -
A TE is started by the DBA.
T2 is lost.
Scenario 3 - Permanent Storage Loss with Cold Restart (and max-lost-archives set to 1)
The details of this scenario are the same as for scenario 2 but the --max-lost-archives
option has been set to 1
instead of 0
:
-
TE1 commits T1.
-
SM1 crashes but the disk is not lost. SM2 continues running.
-
TE1 attempts to commit T2, which would modify the database.
NuoDB does not allow this commit because when the--max-lost-archives
option is set to1
, two SMs must be running in order to commit a transaction that updates, inserts or deletes data. -
SM2 crashes and cannot be restarted.
As the only TE has failed, the database is down. -
SM1 is restarted by the DBA.
This archive defines the database. -
A TE is started by the DBA.
No transactions are lost.
In this scenario, durability is not violated. No transactions were committed when there was only one SM running. A cold restart was required and durability was maintained.
Switching to the Safe Commit Protocol
While the safe commit protocol helps avoid data inconsistencies, for example errors that manifest as the "null descriptor" failure, it does not correct inconsistencies already written to atoms in the archive.
Therefore, if a database was running with a non-safe commit mode, and a corruption was introduced into the archive, switching to safe commit does not resolve the issue.
To resolve the issue, you must first run NuoDB Archive’s nuoarchive check command, using the --repair
option to locate and repair the data inconsistency issues, such as missing descriptors.
For more information on NuoDB Archive, see Command Line Tools.