Performance Problems when there are a Large Number of Datastores in VMware
If you are experiencing storage performance issues, and if you have greater than 30 Datastores in an ESXi cluster, then this most likely is the cause of your storage performance issues, even if you have a high performing all-flash array and a high speed storage network.
First we start with basic definitions of Queue Depth (described here ), and specifically the various Queue Depths in VMware described in the follow-on section on the link above.
A large number of Datastores reduces the Queue Depth of the virtual storage array LUN mapped to the host (DQLEN parameter in VMware), and this in turn results in high latencies.
The figure below helps in understanding the component ‘funnel’ through which storage IO traverses in an ESXi environment.
Math Behind Why A Large Number of Datastores Results in High Latency
Say you have 50 Datastores mapped to a cluster/host, and you have the highest performing storage array, switches, and host based ethernet/ FC adapters. The Queue Depth (AQLEN parameter in VMware) for the software iSCSI adapter is 1024 (Queue Depth of FC adapter is about twice as much), so no more than 1024 storage IO requests can be processed simultaneously by the iSCSI adapter and the underlying NICs. Now you have 50 Datastores mapped to the host so each Virtual LUN device (this is a Datastore device mapped to the host called DQLEN in VMware) on the host will have a QD of 21 (1024 ÷ 50). This is sort of how the logic works, though there is a feature in VMware where the Queue Depth, to a small extent, adjusts dynamically based on throughput, and so you would see the Queue Depth fluctuating between 32 and 0. But since its physically not possible to simultaneously process more than 1024 requests on that ESXi host for the iSCSI adapter, this Queue Depth has to be more or less equally divided by the total number of Datastores on the host. Now Queue Depth = Latency(milliseconds) x IOPS, so if you want 5ms latencies as is the standard expectation of latencies, your IOPS hitting that Datastore can’t exceed 4K IOPS [ IOPS = 21 x 1000 ÷ 5] even though the storage array might be capable of a million plus IOPS. Also, since each virtual LUN Queue Depth in this case is only 21, only 21 IO requests can be processed by that Datastore simultaneously, and IO requests in excess of 21 will get queued, thus increasing the IO latency.
In this specific case where an ESXi host is mapped to 50 iSCSI Datastores, a maximum of 4K IOPS or 21 IO requests per Datastore can be processed at 5ms latencies. So a large number of Datastores mapped to an ESXi host won’t be able to support a high throughput (IOPS) workload. The problem gets worse as the number of Datastores increase.
How VirtuCache Fixes this Issue?
With VirtuCache installed in an ESXi host and configured to cache to in-host SSD or RAM, when a VM does a read operation, it will most likely be serviced from the in-host SSD / RAM, instead of the backend storage array and when a VM does a write operation, it is always written to the in-host SSD / RAM. VirtuCache caches all reads and writes to the in-host SSD or RAM thus bypassing the SAN array for most of the reads, and for cached writes, though these eventually get synced (written) to the backend SAN array, VirtuCache coalesces multiple write IOs into a single IO operation, which reduces the IOPS hitting the backend array. As a result the storage array or network doesn’t play a role in storage latencies when VirtuCache is in the storage IO path. The performance aspects of the storage array and network are completely eclipsed by VirtuCache servicing reads and writes from in-host NVME or RAM.