How To Thoroughly Vet Server Side Caching Software For LUN Corruption Issues ?
Server Side Caching improves storage throughput in the cheapest possible way, Compared to other alternatives like upgrading to All Flash Arrays or upgrading the Storage Network. By using $1/GB SATA SSDs or $3/GB SAS SSDs, storage IOPS to each server can be accelerated to tens of thousands of IOPS and latencies reduced to single digit milliseconds. This easily validates the usefulness and cost effectiveness of Server Side Caching.
One of the more challenging aspects for Server Side Caching vendors is successfully interoperating with the various components in the Storage IO path – from the application in the VM all the way to the Storage Appliance and back. Server Side Caching software from any Vendor modifies the I/O path. Consequently any bugs in the vendor software can corrupt your entire LUN or shutdown I/O, even if the software caches only reads. Hence it is highly recommended to thoroughly evaluate this software, especially in areas where the IO pattern and flow changes drastically with such software in the IO path.
Below are some pointers to the areas in VMware that are more vulnerable to IO path and IO pattern changes introduced by such Server Side Caching software, and so it is highly recommended that you (the prospect) should keenly evaluate these areas for potential data corruption issues :
- Run Database Driven Tests and not just synthetic IOmeter type tests. Iometer or FIO type block level or filesystem level testing is not enough. Database driven tests like TPC-C and TPC-H are some of the best ways to test Server Side Caching and ensure that the LUN does not get corrupt. Especially TPC-C, since it mimics OLTP workload, it is heavy on writes, and hence it is a great way to test not only the performance benefits of Server Side Caching but it is also a great way to bubble up any corruption issues.
- SQLIOSIM – This is a tool that simulates MS SQL Server workload. It does a good job exposing data consistency issues with Server Side Caching software. The benefit of using SQLIOSIM versus running a TPC-C or TPC-H test is that SQLIOSIM doesn’t need a MS SQL Server Database installed.
- FC, FCoE, and iSCSI – Most vendors support iSCSI and FC protocols. If you are using FCoE ensure that your vendor supports this. Also test Software iSCSI and FCoE initiators. Many Server Side Caching vendors use a separate I/O path for software based initiators than they do for the h/w based ones.
- MULTIPATHING PLUGINS – One could use the Native Multi-Pathing Plug-in within VMware or storage vendor specific plug-ins like Powerpath supplied by EMC. Interoperability with Multi-Pathing Plug-ins is probably the single most important aspect of testing Server Side Caching. Test Active-Active, Active-Passive, Path Failover at the very least.
- VAAI –Some storage vendors are in the process of implementing VAAI and as they roll out VAAI on a piece-meal basis, ensure that the Server Side Caching software interoperates with newer VAAI functionality rolled out by your storage vendor.
- Ensure interoperability with any other Kernel Mode software. Some examples are McAfee or Trend Micro’s security solutions, Zerto DR software.
- Virtual Storage Appliances – Many time Virtual Storage Appliances deploy their own filesystems and these most likely will interfere with Server Side Caching. A good example is VMware’s own VFRC does not interop with vSAN.
- VMware DRS – Test with Automatic DRS especially in Aggressive mode, since in Aggressive mode, VMs can migrate between servers rapidly. It will be good for you to see how the cache keeps up with such rapid VM migrations.
- Test Snapshot creation and recovery.
- Test Linked Clones – Some Server Side Caching vendors work better with Linked Clones than others. This is because some vendors understand that the parent needs to be cached only once, and not repeatedly for every cloned VM. This is obviously the better approach.
- Support for Clustering Software – Most vendors should be able to support VMware Clustering. However if you use other Clustering software like Microsoft Clustering Services, Veritas Clustering or Oracle RAC, then chances are very high that the Server Side Caching software does not support these.
- Timing Issues in the SAN – Since in-Server SSDs are 10X+ faster than most older SANs, I have noticed that most Server Side Caching software work fine with these older SANs. When you have a storage appliance with peak performance close to the throughput and latency of in-server SSDs, then the chances of LUN corruption increase many times.
- SSD Failure – Server Side Caching software should not corrupt data when the local SSD fails. Unplug the SSD when Caching is turned on and see what happens.
My point with this article is that do not simply rely on ‘After’ and ‘Before’ testing with Server Side Caching with synthetic testing tools like Iometer. Almost all the Server Side Caching vendors will look much better than storage appliance vendors in such tests. It will take no more than 30 minutes for the Server Side Caching software Vendor to wow prospects with ‘Before’ / ‘After’ results using Iometer.
Please go through a rigorous evaluation of the Server Side Caching software to ensure that the shared LUNs backing a VMware cluster are not corrupted while deploying such software.