How to get high IOPS at low queue dept in VMWare?
There are some applications, typically desktop and SMB CAD/CAM, Manufacturing, ERP software that require large amounts of IOs per second (greater than 10,000) at small queue depth (less than 8).
For a primer on how Queue Depths affect Latency and IOPS, review this post , which defines various types of Queue Depths.
High IOPS at Low Queue Depth is Counter to Storage Component Design
The requirement of high IOPS at low Queue Depth is counter to how storage components are built these days. Faster processors, whether they are storage controllers for storage arrays, RAID cards, FC / iSCSI adapters, etc. lend themselves to higher device queue depths, and in turn high queue depths automatically result in high IOPS. For instance, high end NVME flash drives are marketed as having a million or more IOPS, but that’s at their full Queue Depth of around 2000. At Queue Depths of only 4, these NVME flash drives might only do 30K IOPS.
Equation Connecting Queue Depth to Latency and IOPS
Queue Depth = Latency (sec) X IOPS.
In the above equation, the Queue Depth and IOPS are requirements specified by the application that runs inside a VM and Latency (in seconds) is the VM level latency.
Following from this equation, at low queue depths, if you need moderately high IOPS (5K-40K), the latency you need to eke out on the storage IO path is so low that you can only use locally installed NVME or host RAM, and the storage IO must not traverse any network.
I will illustrate the above point with two use cases.
I am assuming standard application block sizes of 4KB or 8KB, these are small block sizes typical of most applications. Also, I will focus on write IOPS because all media (including NVME and RAM) for small block sizes perform slightly worse for writes than reads.
The first example below illustrates the low QD IOPS that a good NVME is capable of. The second example illustrates where this NVME falls short, and where host RAM is the only option, and surprisingly this happens when you simply halve the QD and double the IOPS.
Use Case 1 – 20K Write IOPS at Queue Depth of 8
Say you want 20K Write IOPS at a QD of 8. From the equation above, this means a latency of 0.4 milliseconds. To achieve this latency, you would need a locally attached high end NVME (or host RAM) serving storage IO.
Use Case 2 – 40K Write IOPS at Queue Depth of 4
Revving up the IOPS further and throttling down the Queue Depth, say you want 40K IOPS at QD of 4, following from the equation above, this means you’d need storage media capable of 0.1ms latency. This latency is now lower than what NVME Flash drives are capable of, such low latencies are only possible with host RAM.
IN BOTH CASES ABOVE, STORAGE IO MUST NOT TRAVERSE ANY NETWORK, FIBER CHANNEL OR ETHERNET, BECAUSE JUST THE NETWORK LATENCIES, EVEN FOR THE FASTEST NETWORKS, ARE MORE THAN 1 MS AT THE VM LEVEL. So even if you have the fastest storage array there is, the network latency, including associated VMware multi-pathing and kernel latency becomes the bottleneck. Hence if you need about 20K IOPS at QD of 8 or less, you must have storage IO serviced from locally attached NVME SSDs. Write IOPS higher than 40K for QD of 4 would need host RAM.
The absolute top end of the low QD IOPS (for say QD = 4) is achieved by using host RAM and those numbers top out at 40K Write IOPS and 45K Read IOPS.
Two Deployment Options – Datastore on Locally Attached NVME OR Host Side Cache to Cache to Host RAM / NVME
If your application needs IOPS and QD in the range described in use case 1 above, then one option is to have your VMware Datastores reside on in-host NVME. A second option is to have a host side caching software installed in the ESXi host that caches reads and writes from your existing SAN array to in-host NVME.
For IOPS and QD in use case 2, the options are Datastore on RAM disk (I am not aware of any production grade RAM disk tech. for ESXi) or host side caching software that caches all storage IO from SAN array to in-host RAM.
Pros and cons of both options are listed below.
Datastore location |
Pros |
Cons |
Locally Attached NVME |
– Cheapest option. |
– Maxes out at 0.4ms VM level latency, so 40K IOPS at QD=4 not possible. – Doesn’t support vSphere features like vMotion and HA. |
Your existing SAN array with caching software installed in the ESXi host that caches all storage IO to in-host RAM / NVME. |
– No changes to your Datastore location, it stays on your current array. – Host Cache can cache to NVME or RAM. – Supports all VMware features like HA, vMotion etc. |
– Host side caching software costs extra. |
SUMMARY
1. All conclusions in the post follow from this equation: Queue Depth = Latency (sec) X IOPS.
2. If you need even moderately high IOPS (10K-40K) at Queue Depths of less than 8, then you need all storage IO in VMware serviced from locally attached media (NVME or RAM), either with VMs residing in a Datastore on locally attached NVME or with host side caching software caching to in-host NVME or host RAM.
3. You cannot have storage IO go over any network (FC, iSCSI, shared SAS, NFS, etc.) whether it is to a storage array or a networked HCI box.