VM latency remains unaffected when a storage array fails in a VMware metro storage cluster, when all VM reads and writes are cached to VMware host SSD or RAM

Use Case:
Storage Performance

Location:
Connecticut, USA

Challenges:

  • Mashantucket Casinos had one HPE 3PAR array and 4 ESXi hosts in each of their two datacenters, a few hundred miles apart, with the two storage arrays and 8 hosts in a uniform VMware Metro / Stretched Cluster (VMSC) configuration. These datacenters needed to be connected by a high speed WAN link. The customer wanted to reduce the cost of their WAN link, without sacrificing the performance and reliability of their metro storage cluster.

Benefits:

  • VirtuCache was deployed with a 3.2TB Samsung PM1735 SSD [performance specs: 250K IOPS random writes / 1.5M randowm reads] in each of of the 8 hosts in both datacenters. By caching all VM reads and writes to this local Samsung SSD, VirtuCache ensured that all writes and 95% of the reads were serviced from the local SSD at all times, regardless of whether the storage array at a site was operational or not (due to failure / maintenance activity).
  • Without VirtuCache, when a storage array in a VMSC is down, all storage IO goes over the WAN link, but with VirtuCache in the IO path, the storage traffic over the WAN link was reduced to only 5% of what it was before. As a result, the customer could go with a lower bandwidth (hence cheaper) WAN link. Also a link to the internet (which has higher peak latencies) compared to a lower latency point-to-point link sufficed, since with VirtuCache in the IO path, peak WAN latencies as high as 200ms can be easily tolerated, without affecting VM performance.
  • In summary - VirtuCache ensured less than 5ms VM latencies at all times regardless of whether the storage array in a VMSC cluster was operational or had failed, and the customer saved on an expensive WAN link.

VM latency remains unaffected when a storage array fails in a VMware metro storage cluster, when all VM reads and writes are cached to VMware host SSD or RAM

The Virtunet Difference
The customer went with VirtuCache because:
  • VirtuCache allowed the customer to buy a cheaper 1gbps WAN link versus a 10gbps WAN link that they would have originally needed to go with (per HPE recommendation), thus saving the customer $115K in WAN link cost over 3 years. In comparison, the cost for perpetual licenses of VirtuCache with 3.2 TB Samsung NVME SSD in each of their 8 hosts was only $49K. So the WAN link related cost savings more than made up for the cost of VirtuCache licenses for 8 hosts and 25TB of NVME SSD capacity.
  • VirtuCache has other advantages. Customer's VMware infrastructure could now tolerate WAN link and iSCSI latency spikes of upto 200ms. It also reduced the load on the customer storage network, WAN, and storage appliances. All this was because most of the storage IO was now serviced from in-host NVME SSDs.
  • September 1, 2020; Connecticut, USA.
    WAN latency does not contribute to VM latencies when a storage array fails in a VMware metro / stretched cluster when all storage IO is cached to VMware host Flash or RAM.

    This post discusses how VirtuCache did this for Mashantucket casinos, by explaining the read and write IO path in VMware Metro Cluster before and after installing VirtuCache.

    WRITE IO PATH IN VMWARE METRO CLUSTER

    At the customer, ESXi hosts in both datacenter locations are connected to storage appliances at both sites. The storage network between ESXi hosts at one site and storage array on the other site goes over a 1gbps WAN link, and the storage network between ESXi hosts and storage array at the same site is over 10gbps LAN.

    Since the IO path from ESXi hosts to the storage appliance at the same site is shorter than the IO path to the appliance at the remote site, all reads and writes from VMs go to the array at the local site only, hence these paths are called active (or ALUA optimized) paths, and the paths from the hosts to the array at the remote site (which is separated by a WAN link) are the inactive paths (or ALUA unoptimized paths). Storage IO goes over the inactive paths only if the storage array with the active paths fails.

    When there is a storage array failure in Datacenter / Site 1, all VM writes and reads from that Datacenter / Site goto to the remote storage array over the WAN link. This will increase VM read and write latencies.
    Write IO path in metro storage cluster with VirtuCache installed on every ESXi host

    Once VirtuCache was installed on all hosts at both datacenters, all writes from VMs are now written to SSD / RAM that’s in the local ESXi host and another copy of the writes is written to SSD / RAM in another host in the same datacenter. This happens regardless of whether the local storage array is in operation or fails. In other words, VirtuCache will send a write acknowledgement back to VMware when VirtuCache commits writes from VMs to cache media in the local ESXi hosts, without the write being committed to backend storage array. Now there is a VirtuCache background job that continuously syncs the locally cached writes to the backend storage array, however this VirtuCache write flush process does not contribute to VM write latency. And it is for this reason that the local storage network latency or inter-datacenter WAN latency does not contribute to VM write latency when VirtuCache is in the IO path.  During regular operation, VirtuCache syncs the write cache to the local array.  When the local storage array fails, VirtuCache syncs the write cache to the remote array. Whether VirtuCache flushes the writes to the local array or remote array, VM write latencies remain the same. Since WAN latency does not factor into VM write latency, a lower bandwidth / higher latency link between datacenter will work just fine.

    When VirtuCache is installed, VM write latencies stay the same whether the storage appliance at the local site is operational or not
    Read IO path in vMSC with VirtuCache installed on every ESXi host

    VirtuCache caches frequently and recently used reads to in-host cache media. Since most of the reads will be serviced from in-host SSD / RAM, there will only be a small number of reads that go over the local storage network (in case of regular operation) or over the WAN link (when the local storage array fails). Since the volume of reads coming from the backend array will be small, a lower bandwidth / higher latency link between datacenter suffices.

    CUSTOMER COST / BENEFIT

    With VirtuCache installed, the customer can now go with a lower bandwidth WAN link, stretch the cluster across longer distances, and tolerate WAN latency peaks of up to 200 milliseconds (much more than the 5-10 millisecond WAN latency that the storage vendor recommends), without adversely impacting VM read and write latencies.

    The customer had 4 hosts at each site and decided to buy a 1gbps WAN link to the internet instead of a 10gbps point-to-point link between datacenter locations.

    The below table lists VirtuCache cost over 3 years and cost savings for the customer due to the fact that they decided to buy a 1gbps internet WAN link instead of a more expensive 10gbps point-to-point link. As you can see the cost saving more than makes up for the money spent on VirtuCache.

    VirtuCache with a 3.2TB Samsung NVME SSD deployed in each of the 8 hosts. Perpetual license with 3-year support. Cost savings because the customer decided to buy a 1gbps WAN link instead of a 10gbps point-to-point link.
    Cost components $5K per host for VirtuCache and $1.1K for the 3.2 TB Samsung NVME SSD. Difference in cost between 10gbps P-2-P WAN link and 1gbps internet WAN link = $3.2K/month
    Cost over 3 years 8 VirtuCache licenses + 25.6TB NVME SSD capacity = $49K WAN cost savings over 3 years = $115K
    Download Trial Contact Us