How does Queue Depth affect latency and IOPS in VMware?
Even if you have a fast array and storage network, and you are still experiencing high storage latencies for your high throughput VMs, it might be because of choked Queue Depths at the VM and virtual LUN device in the host.
This problem will happen only if a few VMs generate most of the IOPS of the VMware cluster, and it is a high IOPS workload.
This post talks about how to diagnose this issue using Esxtop and how VirtuCache fixes it.
Queue Depth = Latency (ms) X IOPS.
Queue Depths in VMware: VM Queue Depth, Adapter Queue Depth (AQLEN), and Device Queue Depth (DQLEN).
IO Depths in VMware : IOs Actively Being Processed (ACTV) and IOs Queued (QUED).
Latencies in VMware : VM Latency (GAVG), Kernel Latency (KAVG), Queue Latency (QAVG), Device Latency (DAVG)
Environment 1 – No Queue Depth Issues if High IOPS are Spread Evenly Between VMs, Hosts, and LUNs.
Environment 2 – Queue Depth becomes an Issue when only a Few VMs Generate most of the IOPS (‘Noisy Neighbor’ Problem).
How VirtuCache Fixes the Noisy Neighbor Issue in VMware?
Queue Depth is the number of storage IOs the device (virtual or physical) can process simultaneously i.e. Queue Depth=Latency (milliseconds) X IOPS.
So if you want a device (RAID controller, FC HBA, iSCSI NIC, etc.) to process 50,000 IOPS at 2ms latency, then the Queue Depth of the device should be 50,000 X (2/1000) = 100.
Queue Depth is assigned to a storage device (virtual or physical) by its developer / manufacturer.
The above Queue Depth equation results in the corollaries listed below:
If the total number of storage IOs stacked against the device (IO depth) exceed the device Queue Depth, then the additional IOs are queued, and this results in higher storage latencies. So, any component in the IO path that has a smaller Queue Depth than the other components will choke the storage IO flowing through the entire stack, if the volume (IOPS) of storage IO flowing through the stack hits the Queue Depth limit of that component. This causes latencies to increase.
For physical hardware, a higher queue depth means that the device can transmit higher storage throughput (IOPS) and is generally of higher quality. For virtual hardware (which is essentially software e.g., VMware’s iSCSI initiator), a higher Queue Depth doesn’t mean that it is better quality software, because the actual work of transmitting the IO is ultimately done by the underlying hardware.
Queue Depth is an important reason why storage IO (data) flows smoothly through the VMware + SAN array storage stack despite the fact that there are many different developers / manufacturers responsible for the different hardware and software in the IO path.
In the below sections, I will explain the connection between Queue Depth, IO Depth, and VM Latency.
Guest VM Queue Depth (GQLEN): By default a VM has a queue depth of 32 per vdisk. VM queue depth is not displayed in Esxtop. VM Queue Depth can be increased by using VMware’s PVSCSI driver in the VM and ESXi host parameter called Disk.SchedNumReqOutstanding (DSNRO).
Device Queue Depth, displayed as DQLEN field in Esxtop : As the storage IO flows from the VM down to the device, as vCenter calls it, which is a LUN mapped to that host (I will call this the virtual LUN device), the storage IO encounters the virtual LUN device Queue Depth, called DQLEN. DQLEN value is assigned to the virtual LUN device by the FC HBA / hardware ISCSI NIC / software iSCSI initiator. It is 128 for software iSCSI, 64 for Qlogic FC HBA, 32 for Emulex FC HBA etc.
To see DQLEN value, from esxcli, run esxtop and press u
Adapter Queue Depth, displayed as AQLEN field in Esxtop: VMware pushes all the IO from the host through the physical FC HBA or the iSCSI initiator. AQLEN of FC HBA is greater than 2000 and for VMware’s iSCSI software initiator it is 1024.
To see AQLEN value, run esxtop, then press d, then press f, then press D to enable QSTATS display, then hit any other key.
IO Depth is the count of IOs being actively processed or queued by a device.
Count of IOs actively being processed by the virtual LUN device, displayed in Esxtop as ACTV.
Count of IOs that exceeds the queue depth of the virtual LUN device, and hence these IOs get queued. Displayed in Esxtop as QUED. Ideally, there should be no IOs queued. Queued IOs result in high VM latencies.
Esxtop shows ACTV and QUED values only at the virtual LUN device level (run Esxtop and press u) and not at the Adapter level, which is sufficient, since Adapter level queues are rarely choked.
Esxtop displays the below latencies (in milliseconds) at the adapter level (Esxtop > press d) and virtual LUN device level (Esxtop > press u).
Guest VM level latency, displayed in Esxtop as GAVG – This is the storage latency as observed in the VM. Regardless of how high or random the throughput/IOPS, this value should always be below 3ms. It is the sum of KAVG and DAVG latency listed below.
VMware kernel latency, displayed in Esxtop as KAVG. KAVG is the time the storage IO takes to go through the VMware kernel. KAVG also includes Queue latency (QAVG). If the QUED field is nonzero then QAVG will display the time the IO takes to go through the VMware Queue. Ideally there should be no IOs queued (QUED=0) and hence QAVG should always be 0. If IO depth of the storage IO exceeds the Queue Depth of the virtual LUN device, then some IO will be queued and that will lead to nonzero QAVG latencies, which in turn increases VM latencies.
Say the DQLEN is 128, and the IO depth of the storage IO hitting that device is 140, then 12 IOs will be queued (QUED = 12), this results in QAVG to increase, which in turn leads to a high KAVG, which in turn results in GAVG to increase.
If everything is performing well then KAVG (that includes QAVG) should be < 0.5ms.
Device latency, displayed in Esxtop as DAVG – This is the latency for the storage IO to pass through the host FC HBA or iSCSI NIC, all the way to the storage array.
If everything is performing well DAVG should be < 1.5ms.
Below are two similar environments with the same number of VMs, Hosts, LUNs, and IOPS, and the same iSCSI SAN array and network. In the first example the Queue Depth is not choked, but in the second it is.
Say each VM pushes 10 IO requests simultaneously (so IO depth of 10), and you have 30 VMs per host, so you are pushing 300 IO requests simultaneously through the iSCSI NIC out to the SAN. Since the Queue Depth for the iSCSI initiator is 1024, it can easily accommodate the IO depth of 300. Now say these 30 VMs are spread over 10 Datastores / LUNs, so each Host is pushing 30 IO requests (IO from 3 VMs) to each virtual LUN device. Since the Queue Depth of each virtual LUN device is 128, there is no bottleneck here as well. So, queue depths across the storage stack work quite well.
VMware was designed for a situation like this where the IO is well balanced across VMs, LUNs, and Hosts.
Say you have 30 VMs per host, 29 of which have an IO depth of 2. And one that has an IO depth of 242. So, the total IO depth for all VMs on the host is still 300, but now you face Queue Depth related performance bottlenecks at two levels – Guest VM and Device. The VM has a Queue Depth of 32, so you will need to increase the Queue Depth of the VM by using the PVSCSI driver from VMware. This will let the entire workload with the IO depth of 242 go through the VM without any queuing.
You also want the high IOPS VM to be on its own LUN without sharing the LUN with other VMs. This prevents DSNRO from getting triggered, else the VM Queue Depth will get throttled further.
Now this workload with IO depth of 242 encounters the iSCSI initiator whose Queue Depth (DQLEN) is 128. So now you will have 114 IOs (242 – 128) queued for that virtual LUN device (QUED = 114). So even if you have a fast storage array & network, you will see VM latencies (GAVG) increase.
VirtuCache caches all reads and writes from VMware to an in-ESXi host NVME or RAM. NVME SSD and host RAM are very high queue depth (>2000). Since most of the reads are now served from host RAM or NVME, instead of the backend storage array, this would eliminate queuing issues on the virtual LUN device and it will free up the virtual LUN device Queue Depth to service writes.
VirtuCache caches all writes from VMware to the local NVME or RAM. It then flushes the write cache to the backend array continuously. So, it doesn’t reduce the volume of writes hitting the SAN array like it does for reads. However, when VirtuCache flushes the writes to backend array, it coalesces the writes, so it sends a smaller number of IOPS to the array than what VMware would have done without VirtuCache. So, the amount of data being written to the array remains the same, only it is transmitted in smaller number of IOs (smaller IO depth), thus reducing the IOPS hitting the array, which reduces the load on the array controller and reduces the queuing pressure on the virtual LUN device.
In addition to installing VirtuCache make sure of the below config in VMware.
Install VMware’s PVSCSI driver in the VMs doing the high IOPS and increase the VM Queue Depth to 255.
Keep the high IO VMs on their own LUNs else you will need to increase DSNRO to 255 from 32
Ensure that VMware’s Storage IO Control (SIOC) feature which throttles down the IO from a noisy neighbor VM is disabled (it is off by default).
A high throughput workload can cause the virtual LUN device Queue Depth to fill up, in which case the VM latencies will increase, regardless of a fast array and low latency storage network.
VirtuCache will remedy the high VM latency issue caused by a full virtual LUN device queue depth, because most reads will now be served from in-host cache media and not from SAN array, thus eliminating queuing on the virtual LUN device.