How to monitor VMware Horizon for improving performance?
VMware Horizon deployments are often plagued with performance problems. Troubleshooting these problems is not a trivial task because there are many reasons for the poor performance of VMware Horizon deployments: storage issues, network bottlenecks, insufficient resources (CPU, RAM, etc.) and software issues. Effective monitoring is the first step in identifying the root cause of poor performance. Once the root cause is discovered, it can then be effectively tackled. For example, VirtuCache is a very effective solution for addressing storage issues in VMware Horizon deployments.
Monitoring guidelines are presented in this article to help administrators identify the performance issues of VMware Horizon deployments.
What to monitor?
Performance Management Metrics
Consumption or utilization metrics, such as IOPS, throughput, usage, etc., are primarily used to manage capacity. These are used to project peak utilization in capacity planning. Resources (CPU, network, disk, etc.) are allocated based on this capacity plan. So, these utilization metrics do need to be measured continuously to ensure that planning projections have not been invalidated and workloads (applications, virtual machines) have adequate resources.
However, high utilization metrics do not always provide clues to the reasons for poor performance. As long as there is no contention between workloads, high utilization does not compromise performance. For example, one VM saturating the disk bandwidth on an ESXi should not be a problem if the other VMs on that host are CPU-bound and not doing much disk-I/O. 100% utilization of provisioned resources is the best case for return on investment. Administrators are used to monitoring these utilization metrics in physical server-based deployments. In virtualized environments, workloads contend with each other for system resources. As a result, the primary metric for performance in VDI deployments is contention. It can manifest in different forms, such as network or disk latency, context switch, packet drops, etc.
Performance problems can manifest even with low utilization metrics. This is because performance is specific to an individual consumer (VM), whereas the utilization metrics typically track the provider (ESXi) level usage. For example, a VM may experience high disk latency even though the throughput or IOPS is low. Contention metrics, thus, provide better clues for troubleshooting performance issues.
So, administrators need to be aware of the significance of these key contention metrics such as CPU Ready, RAM Contention, etc. CPU Ready is an often misunderstood indicator: it does not signify how much the CPU is ready for use. Instead, it indicates the amount of time the virtual machines (and applications) are waiting to be scheduled on the CPU. Thus, the CPU Ready percentage should be low and higher values indicate CPU contention.
Proactive monitoring of performance also requires insights from different angles. If a user complains of performance problems, the following questions to ask are:
- How bad is the problem?
- How many users are affected?
- How long did it last? Is there a pattern?
In other words, the depth and breadth of the problem need to be assessed. In case of an isolated incident, the affected objects need to be closely inspected (e.g. does the affected VM have enough RAM provisioned?). On the other hand, the shared aspects warrant a closer look in case of widespread problems (e.g. do all these affected VMs share a datastore which experienced high latencies?). This is fairly obvious but needs to be stressed.
Strict thresholds of these contention metrics need to be applied to assess the severity and breadth of the problems. For example, while monitoring VMs experiencing CPU contention, if the limit for alerts is too strict, then you will get a lot of early warnings which may not pan out, while if these are too relaxed, then you may not be able to prevent them before they become too severe. Here is a table for reference, which can be tailored based on your deployment:
|How Broad?||% of VMs with CPU Ready > 1%||0-2.5%||2.5-5%||5-10%||> 10%|
|% of VMs with RAM Contention > 1%||0-2.5%||2.5-5%||5-10%||> 10%|
|% of VMs with Disk Latency > 10ms||0-2.5%||2.5-5%||5-10%||> 10%|
|How Deep?||Max VM CPU Ready||0-2.5%||2.5-5%||5-7.5%||> 7.5%|
|Max VM RAM Contention||0-1%||1-3%||3-5%||> 5%|
|Max VM Disk Latency||0-10ms||10-20ms||20-30ms||> 30ms|
How to monitor?
How to monitor performance using VMware Horizon Performance Tracker
Traditional monitoring solutions for VMware infrastructure typically present the information with a lag. This may be acceptable for server infrastructure but inadequate for monitoring the user experience (UX) of end-users. Real-time monitoring is essential for VDI. One way for administrators to stay on top of the end-user experience is by using the VMware Horizon Performance Tracker utility.
The Horizon Performance Tracker is a small utility which runs on a virtual desktop. It monitors the performance of the display protocol and usage of virtual desktop resources by gathering information both from the Horizon Agent as well as the Horizon Client. It is installed as part of the agent installation process for the master image (see image). Please note that this is not selected by default.
To run this tool, connect to the virtual desktop and use the short-cut on the desktop to launch it. The first tab, At a Glance, shows the key metrics: CPU usage, network bandwidth, and frames per second. Clicking on the table icon brings up information about the session, such as the display protocol (Blast of PCoIP) and the video encoder. The second tab, Session Properties, shows detailed information about the session, such as the client details (IP, MAC, FQDN, keyboard language and repeat rate, etc.) as well as the broker information (username, DNS, IP, etc.).
How to monitor performance using vRealize Operations Management Pack
The VMware vRealize Operations Management Pack for Horizon is an excellent resource for monitoring as it provides handy dashboards aiding proactive management of performance bottlenecks. Various customizable alerts and reports are available with these dashboards to aid in operational awareness. Unlike the built-in Horizon Performance Tracker, it needs to be purchased and installed separately.
These dashboards display the performance of the sessions at the consumer layer and the aggregate performance of the workloads at the infrastructure layer. The network, datacenter and storage dashboards are separated out since these are typically owned and managed by separate teams.
The dashboards form a flow, passing context as you drill down. You can drill down from the birds-eye view down to the underlying VM supporting a session. At the top level you can view the dashboard, which covers all the Pods in the Horizon deployment. You can then drill down to either the RDS Farm or VDI Pool. Within each branch, you can drill down to the individual session.
The look and feel of the dashboards share a common design with Summary and Detail sections. The Summary section is generally placed at the top of the dashboard. It gives the larger picture. The Detail section lets you drill down into a specific object. For example, for VM performance, you can get the performance details of a specific VM. This Detail section is also designed with a quick context switch, as you can check the performance of multiple objects during performance troubleshooting. For example, RDS Host Performance dashboard gives you all the RDS Host-specific information and allows you to see the metrics without changing screens. You can move from one RDS Host to another and view the details without opening multiple windows.
The vRealize Operations Management Pack for Horizon offers the following performance dashboards:
- Horizon User PerformanceThis dashboard provides a distribution breakdown of the performance of all the Horizon users by the user performance metric and by Key Performance Indicator (KPI) category: CPU, Disk, Memory, and Protocol. This enables administrators to quickly view the related VDI sessions that are impacted by performance problems. It also provides alerts and KPI breakdowns of the performance metrics impacting the user and the Horizon sessions.
The Horizon World and Horizon User Scoreboard widgets provide a quick glance into the overall performance issues impacting the users from a datacenter and network protocol perspective. These widgets can be used to show current overall performance, as well as how the performance is trending over time.
Horizon users with degraded performance can easily be identified in the User Performance KPI and Worst Performing KPI widgets using the single User Performance KPI metric (Worst KPI (%)).
Clicking a distribution from the doughnut chart allows the administrator to see the users within the performance KPI bucket. Selecting a user within the widget updates the User Object Relationship and displays the Horizon sessions for that user. The critical performance KPIs and alerts on the user object are displayed in the corresponding widgets.
If the user has one or more active sessions, you can select the individual sessions in the Object Relationship widget to see the specific KPIs and alerts impacting the user’s individual sessions. You can also navigate the Object Relationship widget for other Horizon objects (Pools, Farms, Pods, and so on) to see the relevant KPIs for those objects.
Inactive or Disconnected sessions are displayed with a Grey Box, or they do not have an associated VM or RDS Host.
Administrators can also visualize the global user performance per KPI Distribution (Protocol or DC) and select the underperforming users from the additional widgets to begin their root cause analysis.
The users KPI score is an aggregated KPI of multiple performance impacting metrics (CPU, Disk, Memory, and Protocol), and the score reflected is directly impacted by the number of sessions the user has and how many of them are experiencing issues.
- Horizon Datacenter PerformanceReview these charts:
- Count of Pods in the Red: this should be zero
- Average Performance of all Pods: this should be steady
The Pods and World table lists all the Horizon Pods along with the worst performance in the last week for each pod.
- Horizon Network PerformanceThe network portion has Consumer (Protocol) and Provider (Network Infrastructure) layers. These two layers have different metrics. It is important to note that these may not correlate (e.g. the packet loss metric at the consumer layer may not always mean that there is packet loss at the provider layer; the desktop agent measures the packet loss at the protocol layer; agents can drop out-of-order packets, which doesn’t imply packet loss at the provider layer; this may be due to non-optimal routing or misconfiguration)
Latency within the datacenter should be < 1ms
- Horizon RDS Farm PerformanceRDS Farm is a collection of RDS Hosts that are mostly identically configured Window Servers. This dashboard gives an overall performance of the RDS Farms, with the ability to drill down and troubleshoot the farm performance.
Review the Farms Performance Distribution bar chart. Expect all of them to be in the green range. Selecting one of the bars reveals the objects within the bucket. Click the Maximize button in the toolbar of the widget to clearly see the list.
Select one or more entries in the scoreboard. The line chart below the scoreboard plots the selected metrics. Use the metric chart widget to compare metrics to see if there is any correlation. You can also stack them. For example, you can combine Read IOPS and Write IOPS to get the Total IOPS. But, you should not combine Read Latency and Write Latency to get total latency as you must consider the read-to-write ratio.
- Horizon RDS Host PerformanceThis dashboard is designed to complement the RDS Farm Performance dashboard and has a similar design. It acts as the details dashboard, allowing you to drill down from a farm to one of its host members.
Review the table Pods. Expect all of them to be in the green range. Pay attention to the hosts that are not performing. Similarly, review the table RDS Farms and Hosts.
NB: You cannot drill down from RDS Host to its session because there is no relationship. It has to be done at the RDS Farm level.
- Horizon VDI Pool PerformanceVDI Pool is a collection of VDI VMs that are identically configured Microsoft Windows. This dashboard is designed both as an entry point and as a drill-down from the Datacenter dashboard, providing users the ability to drill down into one of the sessions in the pool
- Horizon VDI Session PerformanceVDI Session maps to a VM. A user can have multiple sessions at the same time; each has its own VM. This dashboard gives an overall performance of the sessions, with the ability to drill down and troubleshoot a session performance.
A large environment can have tens of thousands of sessions. To see live performance, use the Live! Horizon Session Performance dashboard.
For each pod, the worst performance in the last week is shown. As vRealize Operations Cloud collects every 5 minutes, there are 12 x 24 x 7 = 2016 data points in a week. This column shows the worst point from the 2016 datapoints.
- Horizon VDI vSphere Cluster PerformancePerformance problems related to vSphere clusters, such as high contention and low utilization, are shown. CPU and Memory are shown separately. CPU problems tend to be more common than memory problems due to the lower overcommit ratio in the memory.
Review the heat map at the top of the dashboard. It shows all the vSphere clusters and only shows clusters that are part of Horizon. Ideally, all of them should be in the green range.
Review the table vSphere Clusters for Horizon. It lists all the clusters sorted by the least performing in the last week. You can change this time period. None of them should be red.
- Horizon Storage PerformanceThis dashboard shows performance problems related to storage, such as high latency, high outstanding IO, and low utilization. It combines contention and utilization metrics in one dashboard but still visually separates them for ease of use. Local datastores are not covered as they are not generally used in Horizon.
Review the two Datastore Performance bar charts. The breadth bar chart measures the population and the percentage of VMs affected.
Review the Datacenters in the Horizon table. Focus on the datacenter with the worst latency. The column is colour coded. Read and Write latency is shown separately for better insight. The nature of the read and write problems may not be the same, so it is helpful to see the difference.
- Horizon Connection Server PerformanceAll the connection servers are listed, sorted by the least performing server in the last week. The health chart shows the trend of the server performance over time. The four scoreboards show the different aspects of the server’s performance.
How to monitor performance using the in-built Horizon Console
The in-built Horizon Console also provides a basic dashboard (which is quite limited as compared to the vRealize Operations Management Pack). The System Health pane displays information about system components that have issues. Each system component can be clicked to get a high-level view of the affected components, status, and description of the issues. The top panel displays the summary details for dashboard statistics, including the latest refresh date and the total number of issues against Sessions. Bar charts displaying the number of active, disconnected, or idle sessions of virtual desktops are also available. The Sessions page displays information about the sessions.