Handling Failure Events

In the normal course of events and in most failure situations, NuoDB runs reliably and consistently and automatically recovers from failure on its own. However, just like any distributed system, the risk exists for a resource in a NuoDB domain to fail in a way that NuoDB cannot automatically fix. As a domain administrator, you might need to recover from, for example, loss of an admin server host or loss of a data center, perhaps because of a power outage.

NuoDB provides tools to help identify if there is a failure. To recover from failure, you need to know which domain resources are running and what is not running, not connected or not reachable. With that information, you can determine the tasks you need to perform to resolve the failure. The tasks required vary according to the resource that has been lost and whether or not there is Admin Process (AP) quorum.

As described in Admin Process Quorum, APs must be running and available on the majority of admin servers in the domain when you want to perform certain domain tasks, such as adding a database process or adding an admin server to the domain. These tasks update the durable domain configuration, which provides domain configuration information that is stored consistently on each admin server in the domain by means of a Raft log.

See also: