Keeping Healthy Systems with vCenter Operations Manager

An Overview of Key Metrics

Virtual Systems uses VMware’s vCenter Operations Manager (vCOPs), a top-tier analytics tool for gathering, measuring and analyzing data in order to reduce risk, optimize efficiency and maintain health of our entire virtual environment. vCOPs collects metrics from various aspects of a VMware environment, including CPU, RAM, networking and disk usage to create a comprehensive view of the current state of the environment. It tracks anomalous data, problem areas, current health and faults to better present the status of the environment for a proactive approach to environment management.

How Virtual Systems Sees Its Entire Infrastructure

While vCenter provides a simple way to get a snapshot of the data, it does little to provide insight into the overall health of the environment. vCOPs collects data to provide long term analytic information on the status of various systems running in a VMWare environment. The three primary metrics vCenter Operations Manager uses to denote overall system operability are health, risk, and efficiency.

Managing System Health

vCOPs measures these metrics on many different levels. It categorizes health from the top down, looking at an entire “world” (which consists of everything running in an environment): vCenter servers, datacenters and so on all the way down to the virtual machine level. Health of any of these objects is determined by key metrics known in vCOPs as workload, anomalies and faults.

Workload measures how hard an object is working. It measures traditional metrics such as CPU usage and demand, memory usage and demand, disk I/O and network I/O. vCOPs is able to determine the usage and demand for each of these objects and makes suggestions on which resource is most constrained, which resources are overprovisioned and even which resources are acting out of their normal bounds. This feature alone is not really something that makes vCOPs stand out on its own but these are key in understanding the basic functionality of any system.
Anomalies takes aggregated workload information over the period of time that the object has been sending data to vCOPs and delivers information on what it perceives to be anomalous activity. This information can be used to predict the needs of various objects from a hardware management standpoint, and also predict problems down the line for disk or network usage.
Faults are pretty straightforward. When an object in vCOPs registers a fault, the information is recorded and the issue is noted in vCOPs, which ensures something like an unplanned reboot of an environment object does not go by unnoticed and the problem can be remediated.

Managing System Risk

vCenter Operations Manager is meant to be a tool that changes the work from a reactive approach of “fix something when it breaks” to a proactive approach that allows an administrator to take care of a potential issue before it actually becomes an issue. This is where the second key metric, risk, comes in to play. Risk is comprised of time remaining, capacity remaining, and stress.

Time remaining is a metric calculated using statistical methods by following trends in the environment relating to standard health metrics (i.e. disk space, memory, disk i/o, CPU, etc) and how they change in use and distribution throughout the lifetime of the environment. It predicts when each of the metrics it is calculating is liable to become a problem. This allows IT to proactively prevent the issue by acquiring more RAM, CPU, or hard disk space, for example.
Capacity remaining is similar to time remaining, but instead of measuring time, it measures environment capacity in terms of virtual machines. vCOPs uses a mathematic model to determine the average amount of space left to determine that amount of additional virtual machines that can be created in the environment without suffering performance loss.
Stress is the last of the key risk factors that vCOPs Manager takes into consideration when determining the amount of risk to the current environment. Stress is a lot like health, but it provides that sort of information over a long timeline in order to determine the hotspots that need attention. Stress helps to show which machines have been under a heavy load and what can be done to remediate it.

Managing System Efficiency

The final metric of vCenter Operations Manager’s primary trio is efficiency. Efficiency helps to locate areas in the environment in which there are opportunities for optimization. Efficiency is determined by several metrics that vCOPs calls reclaimable waste and density.

Reclaimable waste looks at the provisioning of various machines to determine the amount of wasted or overprovisioned vCPUs, disk space, and vMemory in the environment that can be unallocated to machines to be used elsewhere. vCOPs has reporting features that enable IT to get a granular look at the environment in these regards and to see exactly where the overprovisioning is and how to remediate the issues within the environment. This makes sure that important processing power and RAM is used to optimal efficiency.
Density tells how consolidated an environment is by determining VM:Host ratio, vCPU:CPU ratio, and vMEM:Physical Memory ratio. vCOPs has ideal and optimal ratios listed in each of these categories, which can help an environment achieve optimal performance by making sure there are not too many VMs sitting on a single host, or maybe an environment has a low ratio of vMem to physical memory.

Proactive Problem-Solving with vCOP Metrics

Using key metrics from vCenter Operations Manager allows Virtual Systems engineers to understand the overall status of our environment. By leveraging this powerful tool, Virtual Systems is able solve — event prevent¬ many problems before they becomes an issue that impacts users.

Blog