Creating Alarms Using Automation Console

To create an alarm using Automation Console:

  1. On the left, select Dashboard.
  2. In the upper right corner, select Actions | Alarm Definitions.

  3. In the Alarm Definitions widget, select Action | Add.

Follow instructions below for creating either a domain event alarm or a statistical alarm.

Creating a Domain Event Alarm

In the Add Alarm dialog, specify a Name and Description and then select any Type other than Stat Alert. See Descriptions of Domain Event Alarm Types. In the following example, we choose Node Left.

You will also be asked to set Severity Level and Alarm Action. See descriptions below.

The second Add Alarm dialog defines the scope for which this Alarm is being reported. You can specify domain, database, host or process. For database, host and process, you will be prompted to indicate whether the alarm should trigger for all such items or a narrower scope, for example, all databases or just one or more of them.

A current limitation of the Host scope is that it will only be reported while the agent on that host is running, not after that agent restarts.

Try it out: simply kill a database process (TE or SM) that triggers the Node Left event, and let the enforcer restart it. You can also add an alarm for the Node Joined event to get notified when the process restarts.

Creating a Statistical Alarm

In the Add Alarm dialog, specify a Name and Description and then select Stat Alert.

You will also be asked to set the severity level and alarm action.

Click Next to go to the second Add Alarm Definition dialog, which will be slightly more complicated than for simple domain event alarms.

In this dialog, we will select Scope as Domain, which means this alarm will handle all hosts in the domain. The metric we choose is OS-cpuTotalTimePercent. Furthermore, we are going to signal a warning if the average CPU utilization across all hosts in the domain is greater than 60% for more than 10 minutes. All alarms based on metrics have the notion of a breach duration: they only fire if every sample (every 30 seconds) reports true for the condition. If within the 20 minutes an average across all hosts goes below 60% then the alarm watch starts over.

See also: Obtaining Metrics for the Domain, Hosts and Processes.