Re-establishing Admin Process (AP) Quorum

As stated in Admin Process (AP) Quorum, a quorum of Admin Processes (APs) in the domain are needed to make any changes to the durable domain configuration. In the unlikely event that connectivity to enough hosts has been lost so that establishing quorum is not possible, perform the following steps to re-establish AP quorum.

The commands required to re-establishing AP quorum are issued using NuoDB Command (nuocmd). For more information on NuoDB Command and other command line tools, see Command Line Tools.

1. First confirm that action is required and obtain information on the servers that need to be removed from the domain. Run the show domain command to confirm that quorum has been lost and confirm the IDs of the disconnected admin servers. In the example below, servers r0db2, r0db3 and r0db4 are disconnected from r0db0 and r0db1 so quorum is not established and no leader is identified.

It is possible that the admin servers that showing as disconnected are in fact still running and have a majority partition of the domain. For example this could happen if there was a network partition with two hosts (r0db0 and r0db1) on one side of the partition and three hosts (r0db2, r0db3 and r0db4) on the other. If show domain is run connecting to one of the second set of hosts (r0db2, r0db3 or r0db4) then it would show three connected admin servers constituting a majority of the domain.

Confirm independently that the admin servers on the disconnected hosts cannot be restarted or reconnected and action on the remaining hosts is needed to re-establish quorum. If the disconnected admin servers are still running and have a valid quorum you should not re-establish an AP quorum with the minority admin servers.
nuocmd show domain
...
Servers:
  [r0db0] 172.31.45.7:48005 [last_ack = 9.32] [member = ADDED] [raft_state = ACTIVE] (FOLLOWER, Leader=<NO VALUE>, log=5/99/100) Connected *
  [r0db1] 172.31.44.101:48005 [last_ack = 9.32] [member = ADDED] [raft_state = ACTIVE] (FOLLOWER, Leader=<NO VALUE>, log=5/99/100) Connected
  [r0db2] 172.31.42.100:48005 [last_ack = 119.32] [member = ADDED] [raft_state = ACTIVE] (FOLLOWER, Leader=r0db4, log=5/98/98) Disconnected
  [r0db3] 172.31.47.31:48005 [last_ack = 59.32] [member = ADDED] [raft_state = ACTIVE] (FOLLOWER, Leader=r0db4, log=5/98/98) Disconnected
  [r0db4] 172.31.47.176:48005 [last_ack = 29.32] [member = ADDED] [raft_state = ACTIVE] (LEADER, Leader=r0db4, log=5/99/99) Disconnected

In this example the servers r0db2, r0db3 and r0db4 are disconnected. It is determined that r0db2 may be able to be restarted in the future but not soon enough to quickly form a valid quorum. It will not be possible to start the admin service on r0db3 and r0db4 and therefore they are to be removed from the domain.

2. To re-establish an AP quorum the disconnected servers will be temporarily removed from the voting for leadership. This is done by restarting all of the surviving admin servers with the --evicted-servers option and a comma separated list of servers to exclude from voting. This must be done on all of the surviving admin servers. In this example we restart the servers on r0db0 and r0db1 with the following command:

/opt/nuodb/etc/nuoadmin restart --evicted-servers r0db3,r0db4

This removes r0db3 and r0db4 from the voting for AP quorum but will not remove them from the durable domain.

3. Since there are now three voting members instead of the original five, the remaining admin servers form a majority of two out of three. This can be validated by fact that a new leader has been elected and is one of the connected admin servers:

nuocmd show domain
...
Servers:
  [r0db0] 172.31.45.7:48005 [last_ack = 4.89] [member = ADDED] [raft_state = ACTIVE] (LEADER, Leader=r0db0, log=46/103/103) Connected *
  [r0db1] 172.31.44.101:48005 [last_ack = 4.61] [member = ADDED] [raft_state = ACTIVE] (FOLLOWER, Leader=r0db0, log=46/103/103) Connected
  [r0db2] 172.31.42.100:48005 [last_ack = NEVER] [member = ADDED] [raft_state = <NO VALUE>] (<NO VALUE>, Leader=<NO VALUE>, log=?/?/?) Disconnected
  [r0db3] 172.31.47.31:48005 [last_ack = NEVER] [member = ADDED] [raft_state = <NO VALUE>] (<NO VALUE>, Leader=<NO VALUE>, log=?/?/?) Disconnected
  [r0db4] 172.31.47.176:48005 [last_ack = NEVER] [member = ADDED] [raft_state = <NO VALUE>] (<NO VALUE>, Leader=<NO VALUE>, log=?/?/?) Disconnected

4. The disconnected servers can now be permanently removed from the durable domain using the delete server command.

nuocmd delete server --server-id r0db4
nuocmd show domain
...
Servers:
  [r0db0] 172.31.45.7:48005 [last_ack = 5.35] [member = ADDED] [raft_state = ACTIVE] (LEADER, Leader=r0db0, log=46/106/106) Connected *
  [r0db1] 172.31.44.101:48005 [last_ack = 5.35] [member = ADDED] [raft_state = ACTIVE] (FOLLOWER, Leader=r0db0, log=46/106/106) Connected
  [r0db2] 172.31.42.100:48005 [last_ack = NEVER] [member = ADDED] [raft_state = <NO VALUE>] (<NO VALUE>, Leader=<NO VALUE>, log=?/?/?) Disconnected
  [r0db3] 172.31.47.31:48005 [last_ack = NEVER] [member = ADDED] [raft_state = <NO VALUE>] (<NO VALUE>, Leader=<NO VALUE>, log=?/?/?) Disconnected
nuocmd delete server --server-id r0db3
nuocmd show domain
...
Servers:
[r0db0] 172.31.45.7:48005 [last_ack = 9.55] [member = ADDED] [raft_state = ACTIVE] (LEADER, Leader=r0db0, log=46/109/109) Connected *
[r0db1] 172.31.44.101:48005 [last_ack = 9.55] [member = ADDED] [raft_state = ACTIVE] (FOLLOWER, Leader=r0db0, log=46/106/106) Connected
 [r0db2] 172.31.42.100:48005 [last_ack = NEVER] [member = ADDED] [raft_state = <NO VALUE>] (<NO VALUE>, Leader=<NO VALUE>, log=?/?/?) Disconnected
Servers that are deleted may not be able to re-enter the domain with the existing server IDs.

5. Restart the admin servers without the --evicted-servers option.

It is important to restart the admin servers immediately. Failure to do so could cause issues with quorum voting in the domain in the future.

If the issue that caused you to delete a server ID from the domain is resolved you may wish start the admin server on the host in question again.