Automatic Management of NuoDB State

Introduction

In Kubernetes-based deployments, the creation of NuoDB objects in the NuoDB domain state is automated by nuodocker, which is the entrypoint script for Admin, TE, and SM containers. For example, nuodocker start sm is the entrypoint for an SM container, which creates an archive object and a database object (if they do not already exist) before starting an SM process. This is in contrast with bare-metal deployments, where users manage NuoDB domain state by interacting with NuoDB directly, using the nuocmd command-line tool.

In Kubernetes-based deployments, NuoDB objects like NuoDB Admin processes, archives, databases, and database processes are created automatically by nuodocker, while users typically interact only with Kubernetes. To allow users to manage NuoDB clusters indirectly via Kubernetes, a mapping exists between Kubernetes objects and NuoDB objects:

  • Pods controlled by the Admin StatefulSet (identified by Pod name) are associated with Admin servers (identified by server ID).

  • Pods controlled by the database StatefulSet (identified by Pod name) are associated with SM processes (identified by start ID).

  • Pods controlled by the database Deployment (identified by Pod name) are associated with TE processes (identified by start ID).

  • Persistent Volumes Claims (PVCs) bound to SM Pods (identified by claim name) are associated with archives (identified by archive ID).

  • Database StatefulSet and Deployment objects for SM and TE respectively are associated with a database.

To enable observability of NuoDB state via these Kubernetes objects, readiness probes are defined for NuoDB Pods that invoke nuocmd check subcommands. The section below describes how NuoDB objects are automatically updated or deleted in response to Kubernetes events.

Kubernetes-Aware Admin (KAA)

The NuoDB Admin layer has Operator-like capabilities that allow it keep Kubernetes objects in-sync with the corresponding NuoDB objects throughout all stages of the object lifecycle, including deletion. This includes the following use cases:

  1. When the Admin StatefulSet is scaled down, the Admin processes that will no longer be scheduled are excluded from consensus, but not removed from the Raft membership, so that the Admin process can be included when the Admin StatefulSet is scaled back up.

  2. When the database StatefulSet (SMs) is scaled down, the archive IDs associated with the PVCs that will no longer be scheduled are removed, but not purged, so that they can be reused when the database StatefulSet is scaled back up.

  3. When the PVC associated associated with an archive is deleted, the archive ID is removed and purged, because it is no longer bound by the StatefulSet to any SM Pods that are scheduled.

  4. When the container associated with a particular process start ID exits, either because the Pod was deleted or the container exited and was subsequently replaced by a new one within the same Pod, the process is removed from the domain state.

The functionality above is supported by the Kubernetes-Aware Admin (KAA) module, which is enabled by default in Kubernetes-based deployments that follow the conventions described below.

Kubernetes Conventions

In order to enable the resync use cases above, a set of conventions is defined that Kubernetes implementations must follow so that the Admin can unambiguously map Kubernetes objects to NuoDB objects and vice-versa. These conventions are already followed by the NuoDB Helm Charts (ignoring configurations that use DaemonSets for SMs) and are listed below:

  • Admins and SMs are defined by StatefulSets, and TEs are defined by Deployments.

  • The Admin StatefulSet creates Pods with names that are identical to the server IDs.

  • The Admin StatefulSet defines a volumeClaimTemplate with name raftlog, which results in Admin Pods having the volume raftlog used to store Raft data. For example:

    apiVersion: "apps/v1"
    kind: StatefulSet
    metadata:
      annotations:
        description: |-
          NuoAdmin statefulset resource for NuoDB Admin layer.
       ...
      name: {{ template "admin.fullname" . }}
    spec:
      ...
      volumeClaimTemplates:
      - metadata:
          # PVC for Raft data must be named "raftlog"
          name: raftlog
          labels:
            ...
  • Database StatefulSets and Deployments contain the database name in the label database.

  • Database StatefulSets each define a volumeClaimTemplate with name archive-volume, which results in SM Pods having the volume archive-volume used to store archive data. For example:

    apiVersion: apps/v1
    kind: StatefulSet
    metadata:
      labels:
        database: {{ .Values.database.name }}
        ...
      name: sm-{{ template "database.fullname" . }}
    spec:
      ...
      volumeClaimTemplates:
      - metadata:
          # PVC for archive must be named "archive-volume"
          name: archive-volume
          labels:
            database: {{ .Values.database.name }}
            ...
  • A Kubernetes service account and role-binding is defined that gives access to various Kubernetes resources, including Pods, StatefulSets, PVCs (verbs get, list, and watch). The service account credentials must be made available to the Admin and database process Pods. The service account also has privileges to inspect, create, and update Lease objects (verbs get, create, and update), which are used for coordination among Admins when performing resync.

Resync Using Kubernetes State - A Working Example

With explicit mappings between Kubernetes objects and NuoDB objects, it is possible for the Admin to inspect Kubernetes state and to execute Raft commands to cause NuoDB domain state to converge with it, as described above. Each Admin process receives a stream of Kubernetes state-change events which begins with an image of the state at the time that the Admin begins listening.

For the examples below, assume the following minimally-redundant NuoDB deployment consisting of 2 Admins, 2 SMs, and 2 TEs.

$ kubectl get pod
NAME                                                  READY   STATUS    RESTARTS   AGE
dom-nuodb-cluster0-admin-0                            1/1     Running   0          42m
dom-nuodb-cluster0-admin-1                            1/1     Running   0          10m
sm-db-nuodb-cluster0-demo-database-0                  1/1     Running   0          84s
sm-db-nuodb-cluster0-demo-database-1                  1/1     Running   0          5m44s
te-db-nuodb-cluster0-demo-database-6d9c946569-ghvgf   1/1     Running   0          4m27s
te-db-nuodb-cluster0-demo-database-6d9c946569-tv2rb   1/1     Running   0          3m3s

$ kubectl exec dom-nuodb-cluster0-admin-0 -- nuocmd show domain
server version: 4.1.vee-2-644d1d6206, server license: Enterprise
server time: 2020-08-05T14:52:45.414, client token: a0405af0f77144187d3ded054295abd60bba9bc1
Servers:
  [dom-nuodb-cluster0-admin-0] dom-nuodb-cluster0-admin-0.nuodb.default.svc.cluster.local:48005 [last_ack = 1.64] [member = ADDED] [raft_state = ACTIVE] (LEADER, Leader=dom-nuodb-cluster0-admin-0, log=0/182/182) Connected *
  [dom-nuodb-cluster0-admin-1] dom-nuodb-cluster0-admin-1.nuodb.default.svc.cluster.local:48005 [last_ack = 1.63] [member = ADDED] [raft_state = ACTIVE] (FOLLOWER, Leader=dom-nuodb-cluster0-admin-0, log=0/182/182) Connected
Databases:
  demo [state = RUNNING]
    [TE] te-db-nuodb-cluster0-demo-database-6d9c946569-tv2rb/172.17.0.11:48006 [start_id = 12] [server_id = dom-nuodb-cluster0-admin-0] [pid = 39] [node_id = 3] [last_ack =  5.00] MONITORED:RUNNING
    [TE] te-db-nuodb-cluster0-demo-database-6d9c946569-ghvgf/172.17.0.7:48006 [start_id = 13] [server_id = dom-nuodb-cluster0-admin-1] [pid = 39] [node_id = 2] [last_ack =  6.00] MONITORED:RUNNING
    [SM] sm-db-nuodb-cluster0-demo-database-1/172.17.0.9:48006 [start_id = 15] [server_id = dom-nuodb-cluster0-admin-0] [pid = 59] [node_id = 4] [last_ack =  5.00] MONITORED:RUNNING
    [SM] sm-db-nuodb-cluster0-demo-database-0/172.17.0.8:48006 [start_id = 16] [server_id = dom-nuodb-cluster0-admin-1] [pid = 59] [node_id = 5] [last_ack =  2.90] MONITORED:RUNNING

Use Case 1: Admin Scale-down

To support Admin scale-down, all Admin processes need to have access to the Admin StatefulSet (assuming that there is only one Admin StatefulSet for a namespace), and each should exclude from consensus all server IDs in the Raft membership with ordinals greater than or equal to current replica count for the Admin StatefulSet. For example, if the Admin StatefulSet has replicas: 3 and the Raft membership consists of server IDs admin-0, admin-1, admin-2, admin-3, admin-4, then admin-3 and admin-4 should be excluded from consensus. This is done automatically by the Admin processes by overriding the membership in the Raft membership state machine.

Consider the deployment environment above, which has replicas: 2 for the Admin StatefulSet dom-nuodb-cluster0-admin. If we invoke kubectl scale --replicas=1 on dom-nuodb-cluster0-admin, then the Admin with server ID dom-nuodb-cluster0-admin-1 will no longer be scheduled by Kubernetes. But a majority of 2 is 2, so dom-nuodb-cluster0-admin-0 will automatically exclude dom-nuodb-cluster0-admin-1 from consensus in order to allow configuration changes to be made in its absence.

$ kubectl scale --replicas=1 statefulset dom-nuodb-cluster0-admin
statefulset.apps/dom-nuodb-cluster0-admin scaled

$ kubectl exec dom-nuodb-cluster0-admin-0 -- nuocmd show domain
server version: 4.1.vee-2-644d1d6206, server license: Enterprise
server time: 2020-08-05T15:15:03.729, client token: 38590c7b6cbfaab83be4c1ef2c57eb0d4ce977bd
Servers:
  [dom-nuodb-cluster0-admin-0] dom-nuodb-cluster0-admin-0.nuodb.default.svc.cluster.local:48005 [last_ack = 1.35] [member = ADDED] [raft_state = ACTIVE] (LEADER, Leader=dom-nuodb-cluster0-admin-0, log=0/182/182) Connected *
  [dom-nuodb-cluster0-admin-1] dom-nuodb-cluster0-admin-1.nuodb.default.svc.cluster.local:48005 [last_ack = 7.36] [member = ADDED] [raft_state = ACTIVE] (FOLLOWER, Leader=dom-nuodb-cluster0-admin-0, log=0/182/182) Evicted
Databases:
  demo [state = RUNNING]
    [TE] te-db-nuodb-cluster0-demo-database-6d9c946569-tv2rb/172.17.0.11:48006 [start_id = 12] [server_id = dom-nuodb-cluster0-admin-0] [pid = 39] [node_id = 3] [last_ack =  3.15] MONITORED:RUNNING
    [TE] te-db-nuodb-cluster0-demo-database-6d9c946569-ghvgf/172.17.0.7:48006 [start_id = 13] [server_id = dom-nuodb-cluster0-admin-1] [pid = 39] [node_id = 2] [last_ack = 14.17] MONITORED:RUNNING
    [SM] sm-db-nuodb-cluster0-demo-database-1/172.17.0.9:48006 [start_id = 15] [server_id = dom-nuodb-cluster0-admin-0] [pid = 59] [node_id = 4] [last_ack =  3.16] MONITORED:RUNNING
    [SM] sm-db-nuodb-cluster0-demo-database-0/172.17.0.8:48006 [start_id = 16] [server_id = dom-nuodb-cluster0-admin-1] [pid = 59] [node_id = 5] [last_ack = 10.99] MONITORED:RUNNING

Note that dom-nuodb-cluster0-admin-1 still appears in nuocmd show domain output, as do all database processes that are connected to it (start IDs 13 and 16), but is shown as Evicted to signal that it is not participating in Raft consensus. To permanently remove the scaled-down Admin from the membership, the PVC for the scaled-down Admin can be manually deleted by the user, as follows:

$ kubectl delete pvc raftlog-dom-nuodb-cluster0-admin-1
persistentvolumeclaim "raftlog-dom-nuodb-cluster0-admin-1" deleted

$ kubectl exec dom-nuodb-cluster0-admin-0 -- nuocmd show domain
server version: 4.1.vee-2-644d1d6206, server license: Enterprise
server time: 2020-08-05T15:19:29.365, client token: 497c19844e489c6307a2dd315bc43d3475f02191
Servers:
  [dom-nuodb-cluster0-admin-0] dom-nuodb-cluster0-admin-0.nuodb.default.svc.cluster.local:48005 [last_ack = 0.88] [member = ADDED] [raft_state = ACTIVE] (LEADER, Leader=dom-nuodb-cluster0-admin-0, log=0/183/183) Connected *
Databases:
  demo [state = RUNNING]
    [TE] te-db-nuodb-cluster0-demo-database-6d9c946569-tv2rb/172.17.0.11:48006 [start_id = 12] [server_id = dom-nuodb-cluster0-admin-0] [pid = 39] [node_id = 3] [last_ack =  8.76] MONITORED:RUNNING
    [TE] te-db-nuodb-cluster0-demo-database-6d9c946569-ghvgf/172.17.0.7:48006 [start_id = 13] [server_id = dom-nuodb-cluster0-admin-1] [pid = 39] [node_id = 2] [last_ack = >60] MONITORED:UNREACHABLE(RUNNING)
    [SM] sm-db-nuodb-cluster0-demo-database-1/172.17.0.9:48006 [start_id = 15] [server_id = dom-nuodb-cluster0-admin-0] [pid = 59] [node_id = 4] [last_ack =  8.76] MONITORED:RUNNING
    [SM] sm-db-nuodb-cluster0-demo-database-0/172.17.0.8:48006 [start_id = 16] [server_id = dom-nuodb-cluster0-admin-1] [pid = 59] [node_id = 5] [last_ack = >60] MONITORED:UNREACHABLE(RUNNING)

This leaves the database processes that were connected to that Admin in the domain state. They can be manually restarted by deleting the Pods, which will cause the SM and TE to be replaced by ones connected to the remaining Admin:

kubectl delete pod sm-db-nuodb-cluster0-demo-database-0 te-db-nuodb-cluster0-demo-database-6d9c946569-ghvgf

Use Case 2: SM Scale-down

SM scale-down is similar to Admin scale-down, except that the Admin has to map the unscheduled SM Pods to the archive IDs to be removed. The Admin performs the following actions when it detects an SM scale-down event:

  1. Find all of the non-running archive IDs for the database whose StatefulSet was scaled down.

  2. Find the most recent tombstone for each non-running archive ID.

  3. For each archive ID, if the pod-name NuoDB process label in its most recent tombstone has ordinal greater than the current replica count, remove (but do not purge) the archive.

The archive is removed but not purged so that if the database StatefulSet is scaled back up, the archive can be resurrected.

Consider the deployment environment above, which has replicas: 2 for the SM StatefulSet sm-db-nuodb-cluster0-demo-database. If we invoke kubectl scale --replicas=1 on sm-db-nuodb-cluster0-demo-database, then the SM sm-db-nuodb-cluster0-demo-database-1 will no longer be scheduled by Kubernetes. Since storage group leader assignment requires that archive histories for all archive objects for the database are collected, the Admin must exclude the archive object so that the absence of an SM on that archive does not block database restart.

$ kubectl exec dom-nuodb-cluster0-admin-0 -- nuocmd show domain
server version: 4.1.vee-2-644d1d6206, server license: Enterprise
server time: 2020-08-05T15:36:15.209, client token: 3b7260437f92d4867df91ff75abf60e9f3bddd81
Servers:
  [dom-nuodb-cluster0-admin-0] dom-nuodb-cluster0-admin-0.nuodb.default.svc.cluster.local:48005 [last_ack = 0.37] [member = ADDED] [raft_state = ACTIVE] (LEADER, Leader=dom-nuodb-cluster0-admin-0, log=0/207/207) Connected *
Databases:
  demo [state = RUNNING]
    [TE] te-db-nuodb-cluster0-demo-database-6d9c946569-tv2rb/172.17.0.11:48006 [start_id = 12] [server_id = dom-nuodb-cluster0-admin-0] [pid = 39] [node_id = 3] [last_ack =  4.52] MONITORED:RUNNING
    [SM] sm-db-nuodb-cluster0-demo-database-1/172.17.0.9:48006 [start_id = 15] [server_id = dom-nuodb-cluster0-admin-0] [pid = 59] [node_id = 4] [last_ack =  4.52] MONITORED:RUNNING
    [TE] te-db-nuodb-cluster0-demo-database-6d9c946569-4js2b/172.17.0.5:48006 [start_id = 17] [server_id = dom-nuodb-cluster0-admin-0] [pid = 41] [node_id = 6] [last_ack =  8.51] MONITORED:RUNNING
    [SM] sm-db-nuodb-cluster0-demo-database-0/172.17.0.7:48006 [start_id = 18] [server_id = dom-nuodb-cluster0-admin-0] [pid = 59] [node_id = 7] [last_ack =  4.42] MONITORED:RUNNING

$ kubectl scale --replicas=1 statefulset sm-db-nuodb-cluster0-demo-database
statefulset.apps/sm-db-nuodb-cluster0-demo-database scaled

$ kubectl exec dom-nuodb-cluster0-admin-0 -- nuocmd show domain
server version: 4.1.vee-2-644d1d6206, server license: Enterprise
server time: 2020-08-05T15:36:39.996, client token: 6fa1dd68425ce936a956f797279f3023034cb112
Servers:
  [dom-nuodb-cluster0-admin-0] dom-nuodb-cluster0-admin-0.nuodb.default.svc.cluster.local:48005 [last_ack = 1.15] [member = ADDED] [raft_state = ACTIVE] (LEADER, Leader=dom-nuodb-cluster0-admin-0, log=0/212/212) Connected *
Databases:
  demo [state = RUNNING]
    [TE] te-db-nuodb-cluster0-demo-database-6d9c946569-tv2rb/172.17.0.11:48006 [start_id = 12] [server_id = dom-nuodb-cluster0-admin-0] [pid = 39] [node_id = 3] [last_ack =  9.32] MONITORED:RUNNING
    [TE] te-db-nuodb-cluster0-demo-database-6d9c946569-4js2b/172.17.0.5:48006 [start_id = 17] [server_id = dom-nuodb-cluster0-admin-0] [pid = 41] [node_id = 6] [last_ack =  3.98] MONITORED:RUNNING
    [SM] sm-db-nuodb-cluster0-demo-database-0/172.17.0.7:48006 [start_id = 18] [server_id = dom-nuodb-cluster0-admin-0] [pid = 59] [node_id = 7] [last_ack =  9.22] MONITORED:RUNNING

$ kubectl exec dom-nuodb-cluster0-admin-0 -- nuocmd show archives
[0] <NO VALUE> : /var/opt/nuodb/archive/nuodb/demo @ demo [journal_path = ] [snapshot_archive_path = ] RUNNING
  [SM] sm-db-nuodb-cluster0-demo-database-0/172.17.0.7:48006 [start_id = 18] [server_id = dom-nuodb-cluster0-admin-0] [pid = 59] [node_id = 7] [last_ack =  8.10] MONITORED:RUNNING

$ kubectl exec dom-nuodb-cluster0-admin-0 -- nuocmd show archives --removed
[1] <NO VALUE> : /var/opt/nuodb/archive/nuodb/demo @ demo [journal_path = ] [snapshot_archive_path = ] REMOVED(NOT_RUNNING)
  [SM] sm-db-nuodb-cluster0-demo-database-1/172.17.0.9:48006 [start_id = 15] [server_id = dom-nuodb-cluster0-admin-0] [pid = 59] [node_id = 4] EXITED(REQUESTED_SHUTDOWN:SHUTTING_DOWN):(2020-08-05T15:36:32.197+0000) Gracefully shutdown engine (?)

Note that the database object is in RUNNING state even though there is only one SM process (nuocmd show domain) and that there is one active archive object for the database (nuocmd show archives). The archive object still exists in the domain state as a removed archive (nuocmd show archive --removed), so that if the SM StatefulSet is scaled back up, the archive will be bound to and restarted by the next instance of sm-db-nuodb-cluster0-demo-database-1 that is scheduled by Kubernetes.

Use Case 3: PVC Deletion

If a PVC is explicitly deleted for a Pod controlled by a StatefulSet, then a new PVC will be provisioned by Kubernetes for it the next time it is scheduled. In this case, the Admin will automatically remove and purge the associated archive ID, since an SM will never be started on it.

Continuing the example above, archive ID 1, which is associated with PVC archive-volume-sm-db-nuodb-cluster0-demo-database-1, can be purged from the domain state by deleting PVC archive-volume-sm-db-nuodb-cluster0-demo-database-1, which still exists despite the fact that the SM StatefulSet has been scaled down:

$ kubectl delete pvc archive-volume-sm-db-nuodb-cluster0-demo-database-1
persistentvolumeclaim "archive-volume-sm-db-nuodb-cluster0-demo-database-1" deleted

$ kubectl exec dom-nuodb-cluster0-admin-0 -- nuocmd show archives --removed
$

Use Case 4: Pod Deletion or Container Exit

If a Pod is deleted or a database container exits, the Admin should remove any process object generated by it. Normally, the Admin process connected to a database process will detect that it has exited, either because a TCP_RST is generated by the socket connection with the database process, or because the timeout specified by the processLivenessCheckSec property in nuoadmin.conf has elasped since it last received a message from the database process. If the connected Admin is not running, as was the case in the example in the Use Case 1: Admin Scale-down section when the Pods for the orphaned database processes were deleted, then the command to remove the process object from the domain state will be executed as a result of the deletion or state-change event on the Pod.

Scheduling Resync Actions

The actions for use cases 2, 3, and 4, should only be executed by a specific Admin process to avoid executing redundant Raft commands. NuoDB achieves this by using Kubernetes Lease objects to designate a Resync Leader, which is the only Admin process in the Kubernetes cluster that can perform resync actions until the lease expires.

Multi-cluster Support

In a multi-cluster Kubernetes deployment, NuoDB processes are scheduled across separate Kubernetes clusters. Since different Kubernetes clusters generate disjoint events, resync actions are performed by an Admin process in each cluster. The use of Kubernetes Lease objects allows an Admin in each Kubernetes cluster to act as Resync Leader, allowing NuoDB state to converge with multi-cluster Kubernetes state for use cases 2, 3, and 4. Use case 1, which requires all running Admins to determine the complete set of peers that are not being scheduled due to StatefulSet scale down, is not supported in multi-cluster Kubernetes deployments.