Examples of Implementing Back-Off Policy

Create a database with settings such that the required minimum up-time for a database process is 60 seconds, the process will only be retried every 20 seconds and a maximum of 2 times.

nuodb [domain] > create database
Database Name: foo
DBA User: dba
DBA Password: goalie
Template Name (Single Host, Minimally Redundant, Multi Host, Region distributed): 
   Single Host
Template Variables (optional): 
Database Options (optional): mem 1g backoff.reqMinUptime 60000 backoff.maxRetry
   2 backoff.delay 20000 backoff.stopOnMaxRetry true
Timeout (ms/s/m/h/d/w) (optional): 
Template Variable HOST (default: localhost): 54.200.184.21
Database Options for SMs (optional): 
Tag Constraints for SMs (optional): 
Database Options for TEs (optional): 
Tag Constraints for TEs (optional): 
nuodb [domain] > show domain summary
Hosts:
[broker] * ip-172-31-34-74/54.88.10.23:48004 
   (DEFAULT_REGION)
Database: foo, template [Single Host] MET, processes [1 TE, 1 SM], ACTIVE
[SM] ip-172-31-34-74/54.88.10.23:48007 (DEFAULT_REGION) 
   [ pid = 86206 ] [ nodeId = 1 ] RUNNING
[TE] ip-172-31-34-74/54.88.10.23:48008 (DEFAULT_REGION) 
   [ pid = 86218 ] [ nodd = 2 ] RUNNING

The following examples show the agent log content in various circumstances. The comment lines do not appear in the actual log file, but are provided here to aid understanding.

Transaction Engine Leaves Once

In this example, a TE exits once with an uptime less than backoff.reqMinUptime. Its retry gets skipped once and then it is successfully retried. Predefined alarm definitions EnforcerNodeBackoff and EnforcerNodeBackoffMaxRetried are set. Assuming the TE has stopped (simulated with kill -9 and the process ID given in show domain summary), the following is logged to the agent log:

# The node (TE) leaves:
2015-01-22T13:38:52.276-0500 INFO EventManager.notifyNodeEvent (serv-socket5-thread-81) 
   Node left (observed): [Node TE db=[foo] pid=86218 id=2 (local)]
2015-01-22T13:38:52.285-0500 INFO ProcessService.nodeLeft (serv-socket5-thread-81) 
   cleaning up pid 86218 with exit code 137

# The enforcer notices it needs to restart this process:
2015-01-22T13:38:55.365-0500 INFO Enforcer.enforceRemaining (asynch-dao4-thread-9) 
   Enforcer starting processes for database [foo]

# The back-off settings cause the following which indicates the restart of the process is skipped because its uptime
# is 22661 (~23 seconds - less than our backoff.reqMinUptime setting of 60 seconds). The next time the enforcer will
# retry is also reported, that this is the first retry of a maximum 2 retries and that an alarm has been triggered.
2015-01-22T13:38:55.367-0500 WARNING Enforcer.isBackoffDelay (asynch-dao4-thread-9) 
   Skip due to backoff ProcessSpec[foo, TE, peer=[Peer null:48004 (local)]] Stats uptime=22661 
   nextEnforcerTime=2015-01-22T13:39:12.276-0500 retryCount=1/2

2015-01-22T13:38:55.369-0500 WARNING EventingContainerImpl.raise (stats-wtch2-thread-1) 
   Alarm id=5c375a63-4e38-4f8c-a569-f18a96ce2afd entity=[foo] def=[Alarm Definition [EnforcerNodeBackoff.default]: 
   type=EnforcerNodeBackoff dimension=Database entity=* (Warning)] ud=[Database [foo]: 
   Skip due to backoff ProcessSpec[foo, TE, peer=[Peer null:48004 (local)]] Stats uptime=22661 
   nextEnforcerTime=2015-01-22T13:39:12.276-0500 retryCount=1/2]

# Nothing happens...
2015-01-22T13:38:57.922-0500 INFO Enforcer.enforceRemaining (descsvc-enfc13-thread-1) 
   Enforcer starting processes for database [foo]

# Nothing happens...
2015-01-22T13:39:08.002-0500 INFO Enforcer.enforceRemaining (descsvc-enfc13-thread-1) 
   Enforcer starting processes for database [foo]

# Here, the time is past nextEnforcerTime given above and so the enforcer starts the process and the database goes
# back to ACTIVE and RUNNING.
2015-01-22T13:39:18.077-0500 INFO Enforcer.enforceRemaining (descsvc-enfc13-thread-1) Enforcer starting processes for database [foo]
2015-01-22T13:39:18.102-0500 INFO EventManager.notifyNodeEvent (serv-socket5-thread-81) Node joined: [Node TE db=[foo] pid=86253 id=-1 (local)]
2015-01-22T13:39:18.119-0500 INFO EventManager.notifyNodeIdSet (serv-socket5-thread-81) Node setId=3 [Node TE db=[foo] pid=86253 id=-1 (local)]
2015-01-22T13:39:18.119-0500 INFO EventManager.notifyNodeStateChange (serv-socket5-thread-81) Node state changed to ACTIVE [Node TE db=[foo] pid=86253 id=3 (local)]
2015-01-22T13:39:18.120-0500 INFO EventManager.notifyNodeStateChange (serv-socket5-thread-81) Node state changed to RUNNING [Node TE db=[foo] pid=86253 id=3 (local)]

During this time, the EnforceNodeBackoff.default alarm triggered as follows:

$ nuodbmgr ... --command "monitor alarms"
Jan 22, 2015 1:38:55 PM EnforcerNodeBackoff Alarm [EnforcerNodeBackoff.default]:
  Details: Database [foo]: Skip due to backoff ProcessSpec[foo, TE, peer=[Peer null:48004 (local)]] 
     Stats uptime=22661 nextEnforcerTime=2015-01-22T13:39:12.276-0500 retryCount=1/2
     Alarm Definition [EnforcerNodeBackoff.default]: type=EnforcerNodeBackoff 
     dimension=Database entity=* (Warning)

Here too, uptime is 22661, the nextEnforcerTime is reported and the retry count is 1/2.

Transaction Engine Leaves maxRetry Times

In this example, the TE exits twice (backoff.maxRetry), each time with an uptime less than backoff.reqMinUptime. Its retry gets skipped twice and then the database is stopped with regard to enforcement. That is, the database can be active and running while enforcement is disabled. Predefined alarm definitions EnforcerNodeBackoff and EnforcerNodeBackoffMaxRetried are set. Assuming the TE has stopped, which can be simulated with kill -9 and the process ID given in show domain summary, the following is logged to the agent log:

# The node (TE) leaves:
2015-01-22T15:08:13.371-0500 INFO EventManager.notifyNodeEvent (serv-socket5-thread-185) Node left (observed): [Node TE db=[foo] pid=87352 id=6 (local)]
2015-01-22T15:08:13.394-0500 INFO ProcessService.nodeLeft (serv-socket5-thread-185) cleaning up pid 87352 with exit code 137

# The enforcer notices it needs to restart this process:
2015-01-22T15:08:16.468-0500 INFO Enforcer.enforceRemaining (asynch-dao4-thread-15) Enforcer starting processes for database [foo]

# The back-off settings cause the following which indicates the restart of this process is skipped because the uptime
# is 17849 (~18 seconds - less than our backoff.reqMinUptime setting of 60 seconds). It also reports the next time the
# enforcer will retry, that this is the first retry of a maximum 2 retries, and that an alarm has been triggered.
2015-01-22T15:08:16.469-0500 WARNING Enforcer.isBackoffDelay (asynch-dao4-thread-15) Skip due to backoff ProcessSpec[foo, TE, peer=[Peer null:48004 (local)]] Stats uptime=17849 nextEnforcerTime=2015-01-22T15:08:33.372-0500 retryCount=1/2
2015-01-22T15:08:16.469-0500 WARNING EventingContainerImpl.raise (stats-wtch2-thread-1) Alarm id=149399c1-792b-4ca8-8419-d5146b3225c0 entity=[foo] def=[Alarm Definition [EnforcerNodeBackoff.default]: type=EnforcerNodeBackoff dimension=Database entity=* (Warning)] ud=[Database [foo]: Skip due to backoff ProcessSpec[foo, TE, peer=[Peer null:48004 (local)]] Stats uptime=17849 nextEnforcerTime=2015-01-22T15:08:33.372-0500 retryCount=1/2]

# Enforcer does nothing...
2015-01-22T15:08:18.769-0500 INFO Enforcer.enforceRemaining (descsvc-enfc13-thread-1) Enforcer starting processes for database [foo]
2015-01-22T15:08:28.844-0500 INFO Enforcer.enforceRemaining (descsvc-enfc13-thread-1) Enforcer starting processes for database [foo]
2015-01-22T15:08:38.923-0500 INFO Enforcer.enforceRemaining (descsvc-enfc13-thread-1) Enforcer starting processes for database [foo]

# Here, the time is past nextEnforcerTime given above and so the enforcer starts the process and the database goes
# back to ACTIVE and RUNNING.
2015-01-22T15:08:38.955-0500 INFO EventManager.notifyNodeEvent (serv-socket5-thread-185) Node joined: [Node TE db=[foo] pid=87366 id=-1 (local)]
2015-01-22T15:08:38.971-0500 INFO EventManager.notifyNodeIdSet (serv-socket5-thread-185) Node setId=7 [Node TE db=[foo] pid=87366 id=-1 (local)]
2015-01-22T15:08:38.971-0500 INFO EventManager.notifyNodeStateChange (serv-socket5-thread-185) Node state changed to ACTIVE [Node TE db=[foo] pid=87366 id=7 (local)]
2015-01-22T15:08:38.972-0500 INFO EventManager.notifyNodeStateChange (serv-socket5-thread-185) Node state changed to RUNNING [Node TE db=[foo] pid=87366 id=7 (local)]

# Second time the node (TE) leaves:
2015-01-22T15:08:52.207-0500 INFO EventManager.notifyNodeEvent (serv-socket5-thread-185) Node left (observed): [Node TE db=[foo] pid=87366 id=7 (local)]
2015-01-22T15:08:52.228-0500 INFO ProcessService.nodeLeft (serv-socket5-thread-185) cleaning up pid 87366 with exit code 137

# The enforcer notices it needs to restart this process:
2015-01-22T15:08:55.302-0500 INFO Enforcer.enforceRemaining (asynch-dao4-thread-17) Enforcer starting processes for database [foo]

# Again, the back-off settings kick in. Here the restarting of the process is skipped again because the uptime is 13252
# (~13 seconds - less than our backoff.reqMinUptime setting of 60 seconds). This is retry number 2 of 2 and it reports
# that the enforcer will never run again.
2015-01-22T15:08:55.303-0500 WARNING Enforcer.isBackoffDelay (asynch-dao4-thread-17) Skip due to backoff ProcessSpec[foo, TE, peer=[Peer null:48004 (local)]] Stats uptime=13252 nextEnforcerTime=(never) retryCount=2/2
2015-01-22T15:08:55.303-0500 WARNING EventingContainerImpl.raise (stats-wtch2-thread-1) Alarm id=01e04ddf-e62f-470b-8a08-a9a52cf07630 entity=[foo] def=[Alarm Definition [EnforcerNodeBackoffMaxRetried.default]: type=EnforcerNodeBackoffMaxRetried dimension=Database entity=* (Warning)] ud=[Database [foo]: Skip due to backoff ProcessSpec[foo, TE, peer=[Peer null:48004 (local)]] Stats uptime=13252 nextEnforcerTime=(never) retryCount=2/2]

The alarms triggered will now include EnforcerNodeBackoffMaxRetried.default.

$ nuodbmgr ... --command "monitor alarms"
Jan 22, 2015 3:08:16 PM EnforcerNodeBackoff Alarm [EnforcerNodeBackoff.default]:
   Details: Database [foo]: Skip due to backoff ProcessSpec[foo, TE, peer=[Peer null:48004 (local)]] 
   Stats uptime=17849 nextEnforcerTime=2015-01-22T15:08:33.372-0500 retryCount=1/2
   Alarm Definition [EnforcerNodeBackoff.default]: type=EnforcerNodeBackoff dimension=Database entity=* (Warning)
Jan 22, 2015 3:08:55 PM EnforcerNodeBackoffMaxRetried Alarm [EnforcerNodeBackoffMaxRetried.default]:
   Details: Database [foo]: Skip due to backoff ProcessSpec[foo, TE, peer=[Peer null:48004 (local)]] 
   Stats uptime=13252 nextEnforcerTime=(never) retryCount=2/2
   Alarm Definition [EnforcerNodeBackoffMaxRetried.default]: type=EnforcerNodeBackoffMaxRetried 
   dimension=Database entity=* (Warning)

At this point the database is UNMET and INACTIVE . We need to explicitly start the database. Once the maximum retry was met, all information about the back off was cleared, so starting the database will start the TE.

nuodb [domain] > show domain summary

Hosts:
[broker] * localhost/10.1.37.60:48004 (DEFAULT_REGION)

Database: foo, template [Single Host] UNMET, processes [0 TE, 1 SM], INACTIVE
[SM] 52.25.32.20/10.1.37.60:48007 (DEFAULT_REGION) [ pid = 86206 ] [ nodeId = 1 ] RUNNING
nuodb [domain] > start database foo
nuodb [domain] > show domain summary

Hosts:
[broker] * localhost/10.1.37.60:48004 (DEFAULT_REGION)
Database: foo, template [Single Host] UNMET, processes [1 TE, 1 SM], ACTIVE
[SM] ip-172-31-34-74/10.1.37.60:48007 (DEFAULT_REGION) [ pid = 86206 ] [ nodeId = 1 ] RUNNING
[TE] ip-172-31-34-74/10.1.37.60:48008 (DEFAULT_REGION) [ pid = 87466 ] [ nodeId = 8 ] RUNNING