Case Studies

This section has several examples of real-world situations where NuoDB Insights is used to monitor overall database health.

Degraded performance

Problem Description

The user complained of degradation of performance from around 01:00 hours.

Figure 1. Summary graph

Analysis

Throttling (magenta) occurred much earlier than 01:00 hours, starting from about 13:00 hours the previous day.

Figure 2. Summary graph from 13:00 to 18:00

Throttling (magenta) continued to be a larger part of each measurement time slice, until around 06:00 hours, where durable commits (yellow) started to take much longer to get acknowledged.

Figure 3. Summary graph from 04:00 to 08:00.

Throttling (magenta) takes over again from 22:00 hours until the load is reduced at 11:00 hours.

Figure 4. Summary graph from 20:00 to 00:00

Conclusion

Network congestion was identified as the culprit for durable commit acknowledgment time, with disk IO performance taking the rest.

One slow disk

Problem Description

The user complained of a slow disk.

Figure 5. Fsync/Directory graph

Analysis

The fsync time increasing for the light blue line indicates more time taken to write the same data. The archive queue is increasing for the light blue line. This indicates the Storage Manager (SM) and Transaction Engine (TE) lengthening the queue due to slow data write-speed.

Figure 6. Archive Queue graph

The light blue line on the Bytes Written Per Second graph, shows a decrease.

Figure 7. Bytes Written Per Second graph

Conclusion

The disk for the host for the TE or SM represented by light blue plot is underperforming.

Underperforming SSD disks

Problem Description

The user complained of underperforming SSD disks

Analysis

Fsync graph showing long periods of high usage of fsync which impacted performance. A closer inspection reveals that the periods of low fsync happened after long periods of SM downtime.

Figure 8. Fsync/Directory graph

Figure 9. Fsync/Directory graph

Conclusion

It was observed that the maintenance program for SSDs had not been activated and the SSDs were suffering from write amplification. Long periods of downtime allowed the disk’s garbage collection to be run to recover performance. An example of a maintenance program might be fstrim on a cron job during low utilisation periods.