Recovering from a Lost Majority

As discussed in About Admin Process Quorum a quorum of the admin processes in the domain are needed to make any changes to the durable domain configuration. In the unlikely event that connectivity to enough hosts has been lost, and establishing admin process quorum is not possible, do the following:

Note: The show domain and delete server commands are issued using NuoDB Command (nuocmd). For more information on NuoDB Command and other command line tools, see Command Line Tools.

This section provides an example of the commands required to re-establish admin process quorum.

1. Run the show domain command to confirm that quorum has been lost.

nuocmd show domain
server version: 4.0.rel40dev-21-546a5cbef7, server license: Enterprise
server time: 2019-05-02T17:09:03.615, client token: f29d9fee6b4259c3e8c0c715e15e81b447ba6811
Servers:
  [r0db0] 172.31.45.7:48005 (FOLLOWER, Leader=<NO VALUE>) Connected *
  [r0db1] 172.31.44.101:48005 (FOLLOWER, Leader=<NO VALUE>) Connected
  [r0db2] 172.31.42.100:48005 (FOLLOWER, Leader=r0db4) Disconnected
  [r0db3] 172.31.47.31:48005 (FOLLOWER, Leader=r0db4) Disconnected
  [r0db4] 172.31.47.176:48005 (LEADER, Leader=r0db4) Disconnected
Databases:  
  dbt2 [state = AWAITING_ARCHIVE_HISTORIES_INC]
    [SM] 6a03ac17-e8e1-48e1-bbd9-b4d195d7081d-r0db1/172.31.44.101:48006 [start_id = 5] [server_id = r0db1] MONITORED:RUNNING
    [TE] 6a03ac17-e8e1-48e1-bbd9-b4d195d7081d-r0db2/172.31.42.100:48006 [start_id = 8] [server_id = r0db2] MONITORED:UNREACHABLE(RUNNING)
    [TE] 6a03ac17-e8e1-48e1-bbd9-b4d195d7081d-r0db4/172.31.47.176:48006 [start_id = 9] [server_id = r0db4] MONITORED:UNREACHABLE(RUNNING)

Admin process r0db2, r0db3 and r0db4 are disconnected from r0db0 and r0db1 so quorum is not established and no leader is identified.

Note: Confirm independently that the admin server on these hosts cannot be restarted and action on the remaining hosts is needed to re-establish quorum.

2. Restart NuoDB Admin with only the surviving admin processes, using the --evicted-servers option to exclude admin processes from quorum voting.

service nuoadmin restart --evicted-servers <r0db3,r0db4>

The --evicted-servers option does not remove admin processes from the domain, but allows the remaining processes to re-establish quorum. Running the command above excludes r0db3 and r0db4 from voting so that a quorum of two of three voting members can be established.

3. Run the show domain command again to confirm that admin process quorum has been established and a leader elected

nuocmd show domain
server version: 4.0.rel40dev-21-546a5cbef7, server license: Enterprise
server time: 2019-05-02T17:14:29.888, client token: a03839797ddb6297fe6d6eb96afee815b2faa302
Servers:
  [r0db0] 172.31.45.7:48005 (LEADER, Leader=r0db0, log=46/103/103) Connected *
  [r0db1] 172.31.44.101:48005 (FOLLOWER, Leader=r0db0, log=46/103/103) Connected
  [r0db2] 172.31.42.100:48005 (<NO VALUE>, Leader=<NO VALUE>, log=?/?/?) Disconnected
  [r0db3] 172.31.47.31:48005  (<NO VALUE>, Leader=<NO VALUE>, log=?/?/?) Disconnected
  [r0db4] 172.31.47.176:48005 (<NO VALUE>, Leader=<NO VALUE>, log=?/?/?) Disconnected
Databases:  
  dbt2 [state = REQUESTED_TE_SHUTDOWN]
    [TE] 6a03ac17-e8e1-48e1-bbd9-b4d195d7081d-r0db2/172.31.42.100:48006 [start_id = 8] [server_id = r0db2] MONITORED:UNREACHABLE(RUNNING)

4. Permanently remove disconnected servers from the durable domain using the delete server command.

Note: Deleted servers may not be permitted to re-enter the domain at a later time (with the existing server IDs).

nuocmd delete server --server-id r0db4

 

Reprovisioning Evicted Servers

You can start an admin server on a host that was removed from the domain (using the --evicted-servers option) by re-peering it with the domain. First remove the raftlog file.

cd /var/opt/nuodb

rm raftlog

If the server started was listed in the initial membership of the domain the server id must be changed before attempting to repeer with the domain. In the file /etc/nuodb/nuoadmin.conf find the server id identified by ThisServerId. If the server id is listed in the initialMembership list the server id must be changed before the admin server is started. In this example the serverId for r0db3 is changed to r0db3_repro and the admin server is started on the host. The new domain membership can be seen here:

nuocmd show domain
server version: 4.0.rel40dev-29-21929d6e8b, server license: Enterprise
server time: 2019-06-26T21:26:57.575, client token: 1a731b10a72c9e261467c96401afa8edf896c614
Servers:
[r0db0] 172.31.43.237:48005 (FOLLOWER, Leader=r0db1, log=95/143/143) Connected
[r0db1] 172.31.46.156:48005 (LEADER, Leader=r0db1, log=95/143/143) Connected
[r0db2] 172.31.34.187:48005 (FOLLOWER, Leader=r0db1, log=95/143/143) Connected
[r0db3_repro] 172.31.35.152:48005 (FOLLOWER, Leader=r0db1, log=95/143/143) Connected *
Databases:
dbt2 [state = RUNNING]
[SM] 2938b903-d708-40c7-9cca-82bcc9859375-r0db1/172.31.46.156:48006 [start_id = 5] [server_id = r0db1] MONITORED:RUNNING
[SM] 2938b903-d708-40c7-9cca-82bcc9859375-r0db0/172.31.43.237:48006 [start_id = 6] [server_id = r0db0] MONITORED:RUNNING
[TE] 2938b903-d708-40c7-9cca-82bcc9859375-r0db2/172.31.34.187:48006 [start_id = 7] [server_id = r0db2] MONITORED:RUNNING