Loss of a Storage Manager Host

Symptom

It is possible for a host that was running a storage manager to disappear, perhaps because of a power outage. A cold start of the database that was served by the storage manager on the lost host would fail with a message such as Failed to get initial node... archive in use.

Cause

NuoDB supports durable last-known configuration of database processes (TEs and SMs) to ensure safety under failure. This means that if a host is lost then the durable domain configurationThe durable domain configuration provides domain configuration information that is stored consistently on each broker in the domain by means of a Raft log. might list a database process without any indication that the database properly shut down. The rest of the domain assumes that that database process is still running and a database cold restart is prevented. Consider the following:

  1. A database runs multiple processes on three hosts.
  2. One of the hosts runs a redundant SM process and that host is killed or partitioned.
  3. The rest of the domain never receives a signal that the SM has shut down. The SM is still registered in the Domain State Machine (DSM), which is part of the durable domain configuration.
  4. Attempting a cold restart of the database might fail during the archive history check, which visits all known archives of a database to determine which SM to start first. The following output shows this situation:

    2015-01-15T13:50:48.675-0500 WARNING com.nuodb.agent.service.
       DescriptionService$DescriptionEnforcer.enforce (asynch-dao4-thread-8) 
       Failed to enforce database [test]: archive history can not visit 
       {uuid:6e2a2d4c-f26e-43f0-9a00-73c2c0ee60f0=SM location: RequirementsName=[SMs], 
       Region=[us-east-1], Archive=[/var/opt/nuodb/production-archives/one], 
       Journal-dir=[/var/opt/nuodb/production-archives/one_journal]}
        at com.nuodb.impl.util.Preconditions.ioCheckArgument(Preconditions.java:32)
        at com.nuodb.agent.description.Enforcer.buildPeerHistories(Enforcer.java:252)
        at com.nuodb.agent.description.Enforcer.enforceColdSM(Enforcer.java:207)
  5. After removing the aforementioned unreachable SM archive locations from the database, the archive history check now finishes:

    nuodb [test] > remove database archiveLocation dbname test host 
       6e2a2d4c-f26e-43f0-9a00-73c2c0ee60f0 path /var/opt/nuodb/production-archives/test 
    
    nuodb [test] > show database config database test 
    
    Database: test, INACTIVE, Status=STOPPED, template [Multi Host] 
      Variables: {SM_MAX=2, TE_MIN=1, REGION=us-east-1, SM_MIN=1} 
      Options: {commit=remote:1, mem=2g}    
      Default Options: { "commit": "${COMMIT:remote:1}","backoff.reqMinUptime":"30000","hostLimit":"${HOST_LIMIT:false}"}
      Archive Locations: 
        ip-172-31-44-94/54.165.247.81:48004, requirements: SMs, region: us-east-1: 
          archive: /var/opt/nuodb/production-archives/one 
          journal-dir: /var/opt/nuodb/production-archives/one_journal 
      Multi Host UNMET 
    ...
    
  6. But a cold restart of the database still fails. You would see a NodeFailed alarm, if you defined on, with the message Unable to find entryNode for non-initial node.

    2015-01-15T13:51:06.766-0500 [3678] Starting Storage Manager 2.1.rel21-68-d3b189a: 
       database=test, protocol=1.102.d3b189a
    2015-01-15T13:51:07.907-0500 INFO com.nuodb.agent.net.NetworkContainerImpl.acceptMessage 
       (serv-socket5-thread-17) listener failed to handle message
    java.lang.IllegalArgumentException: Unable to find entryNode for non-initial node
        at com.nuodb.impl.util.Preconditions.checkArgument(Preconditions.java:9)
        at com.nuodb.agent.db.DatabaseContainerImpl$DatabaseImpl.introduceChorusMember
          (DatabaseContainerImpl.java:459)
        at com.nuodb.agent.service.HostService.handleChorusEntry(HostService.java:425)
  7. The issue is that when the SM host was suddenly taken offine, the SM's entry was not removed from the durable domain configuration. when the new SM started, the broker told it "you are not the initial database process" (decided by the DSM) but a "live" database process cannot be found. The SM process will hang around forever holding the archive lock. Any subsequent attempt fails with:

    java.io.IOException: Failed to parse response: Show archive history failed: Archive 
          is already in use: /var/opt/nuodb/production-archives/test
        at com.nuodb.agent.service.ProcessService.parseXmlOutput(ProcessService.java:651)
        at com.nuodb.agent.service.ProcessService.parseXmlOutputForPeerResponse
          (ProcessService.java:658)
        at com.nuodb.agent.service.ProcessService.startProcess(ProcessService.java:449)

Solution

Remove the process from the durable domain configuration:

  1. Stop the database by using the NuoDB Manager command shutdown database database_name in order to stop any enforcement, and shut down any processes on the remaining hosts. In the above scenario, there are no other processes.
  2. Execute the NuoDB Manager domainstate list command to obtain the process ID of the SM.
  3. Run the following command to removed the SM from the durable domain configuration:

    domainstate removeProcess id stableId
  4. Manually kill the hung SM process that holds the lock on the archive, PID 3678 in the above scenario.
  5. Start the database.