Why are Write Latencies High Even When the EMC Unity Appliance has Plenty of System Cache Memory and Fast Cache SSDs?
If you are experiencing high write latencies in VMs when using an EMC Unity Hybrid array, it could be because the caching and tiering solutions in Unity don’t cache random small block writes, and your applications might be generating large volumes of such storage IO.
EMC Unity has three controller based caching and tiering features – System Cache (caching to RAM in the array), Fast Cache (caching to SSD in the array) and Fast VP (tiering to SSD in the array). From reviewing EMC’s documentation at this link, I have concluded that:
Fast Cache caches only reads and not writes.1
System Cache caches reads and writes to Unity RAM, but it doesn’t work for random IO or small block size IO. 2
Fast VP is a tiering solution in Unity, and as with any storage appliance based Tiering solution, Fast VP only accelerates reads. 3
So, EMC Unity will not cache small block (under 64KB) random writes.
Traditional IT workload that run in Windows VMs use small block sizes (typically under 16KB). And in your case, if the workload also happens to be write intensive then EMC Unity might show high latencies, despite your Unity appliance having large amounts of System Cache RAM, and Fast Cache or Fast VP SSDs.
How VirtuCache fixes this issue with random writes?
VirtuCache caches all frequently and recently accessed random reads and all random and sequential writes, to in-VMware host Flash or RAM. Also, VirtuCache caches data in exactly the block size that the applications read and write data in, so VirtuCache works equally well for large or small block IO. The only IO that VirtuCache doesn’t cache are sequential reads, because by definition blocks in sequential reads are read once and most likely not read again for a long time (hence the term ‘sequential’), and so there is no point in caching such reads.
Why is VirtuCache able to cache small block random writes but not EMC Unity?
Of all type of storage IO that stresses storage systems, the worst kind is highly random, small block size, write IO. Firstly, the block size is small, so a VM can emit large amounts of small block write IO quickly (compared to large block size IO), and with higher VM density, the total small block write IOPS hitting the appliance could be huge. Secondly, whether the block is 1MB or 4KB, the same number of CPU cycles (CPU on the storage appliance controller motherboard) are used to process the block. Now if the IO is random, it further aggravates storage CPU usage since a large amount of metadata index needs to be scanned to read / write random blocks. As a result, a lot of CPU cycles are required to process random small block write IO. Now VirtuCache uses ESXi host CPUs for caching operations, and not storage appliance CPUs. Each ESXi host typically has two or more CPUs, with each CPU typically having 16 or more cores, now multiply this by the number of hosts that VirtuCache is installed on. As you can see VirtuCache has access to large amounts of CPU. This is not the case with a storage appliance. A storage appliance typically has two controllers, each with one or two 4/8 core count processors. As a result, storage appliance caching has access to fewer CPUs. So small block random write IO is better handled with VirtuCache than with any storage appliance caching functionality, since VirtuCache has access to a lot more CPU than any storage appliance. This argument also holds true for mid-market all-flash arrays.
Other advantages of using VirtuCache versus storage appliance based caching.
Cache media closer to the CPU with VirtuCache – In the case of VirtuCache, the cache media is right on the motherboard of the VMware host CPU that consumes hot data, and connected to the host CPU over a dedicated PCIe/NVME bus (if using PCIe / NVME SSD to cache) or memory bus (if using host RAM as cache media). Versus in the case of storage appliances, where the SSDs are behind the shared storage network and storage controllers.
Higher performing than any appliance, if you use host RAM with VirtuCache – This is because RAM is the fastest storage media there is. We will be higher performing than any storage appliance in this case, because storage appliances are not designed to use large amounts of RAM in the storage IO path.
In the case of EMC Unity, SAS SSDs are used with Fast Cache.5 With VirtuCache you can use higher performing NVME SSDs and even higher performing host RAM as cache media.
Lower Cost – VirtuCache costs $3000 per host for a perpetual license. An enterprise grade NVME / PCIe SSD, like the 2TB Intel P4600, that does a whopping 600K IOPS random small block reads and 200K IOPS random small block writes, costs $1200 (2019 prices). As a result, VirtuCache will be higher performing and lower cost than any controller based caching or tiering solution in any appliance, including EMC Unity.
Cross references to EMC white paper on System Cache, Fast Cache, and Fast VP.
1 – Page 17 says this regarding Fast Cache – ‘The system copies the highly accessed 64 KB chunks of data from their current locations on spinning drive to FAST Cache.’ This means that Fast Cache caches only reads and not writes.
2 – Page 5 says – ‘System Cache (DRAM Cache) – EMC Unity software component which leverages DRAM memory to improve host read and write performance.’ So, System Cache does cache writes. However the table on Page 31 (copied below), says its best suited for sequential IO and large block (> 64KB block size) IO. This means that System Cache is ill suited to cache random small block IO.
3 – Page 4 says that ‘FAST VP analyzes this data and makes decisions to move data across multiple tiers in a Pool’. This means that Fast VP tiers only reads to Flash, and not writes.
4 – Search for ‘SAS FLASH 2’ on page 30.