How Can We Help?
Paths get disabled when VirtuCache fronts CEPH
When Virtucache is accelerating CEPH (over iSCSI), and you get a vcenter alert saying
“Lost storage path redundancy”
then you goto each datastore and see if the number of active paths are less than what they should be.
Step 1 : Log in to the affected ESX and check the vmkernel logs
dmesg | grep “with status Timeout”
2017-12-06T22:03:50.689Z cpu6:103634645)VMW_SATP_ALUA: satp_alua_issueCommandOnPath:656: Path “vmhba36:C7:T1:L0” (UP) command 0xa3 failed with status Timeout. H:0x5 D:0x0 P:0x0 Possible sense data: 0xd 0x0 0x0.
If you see a lot of log lines with above signature then this means we have encountered this condition (where ESXi does not mark the path down (and use other paths) because the underlying transport/iscsi connectivity is still up but all/most I/O commands are failing, due to the array not servicing those commands).
NB: If there are only a few intermittent logs of this kind then this problem/condition may not be present/applicable.
Find out the affected paths from the log, for instance in the above log line the affected path is “vmhba36:C7:T1:L0”
2) Find out the “Target Portal Tag” of the ISCSI gateway (CEPH Server) corresponding to these paths
esxcli storage core path list
iqn.1998-01.com.vmware:kvh-vmwhost1-60a76cb1-00023d000008,iqn.2003-01.org.linux-iscsi.igw.x86:cephvol1,t,2-naa.60014055ad287fa4b78303c97b764343
UID: iqn.1998-01.com.vmware:kvh-vmwhost1-60a76cb1-00023d000008,iqn.2003-01.org.linux-iscsi.igw.x86:cephvol1,t,2-naa.60014055ad287fa4b78303c97b7643
43
Runtime Name: vmhba36:C7:T1:L0
Device: naa.60014055ad287fa4b78303c97b764343
Device Display Name: SUSE iSCSI Disk (naa.60014055ad287fa4b78303c97b764343)
Adapter: vmhba36
Channel: 7
Target: 1
LUN: 0
Plugin: NMP
State: active
Transport: iscsi
Adapter Identifier: iqn.1998-01.com.vmware:kvh-vmwhost1-60a76cb1
Target Identifier: 00023d000008,iqn.2003-01.org.linux-iscsi.igw.x86:cephvol1,t,2
Adapter Transport Details: iqn.1998-01.com.vmware:kvh-vmwhost1-60a76cb1
Target Transport Details: IQN=iqn.2003-01.org.linux-iscsi.igw.x86:cephvol1 Alias= Session=00023d000008 PortalTag=2
Maximum IO Size: 131072
esxcli storage core path list | grep -A 15 “vmhba36:C7:T1:L0” | grep “PortalTag=”
Target Transport Details: IQN=iqn.2003-01.org.linux-iscsi.igw.x86:cephvol1 Alias= Session=00023d000008 PortalTag=2
3) Log in to any of the ceph servers and issue the following command to determine the server/iscsi-gateway corresponding to the affected portal tag/group (highlighted in red below)
sudo targetcli ls
o- / …………………………………………………………………………………………………………. […]
o- backstores ……………………………………………………………………………………………….. […]
| o- fileio ……………………………………………………………………………………… [0 Storage Object]
| o- iblock ……………………………………………………………………………………… [0 Storage Object]
| o- pscsi ………………………………………………………………………………………. [0 Storage Object]
| o- rbd ……………………………………………………………………………………….. [2 Storage Objects]
| | o- rbd-cephvol1 …………………………………………………………………. [/dev/rbd/rbd/cephvol1 activated]
| | o- rbd-cephvol2 …………………………………………………………………. [/dev/rbd/rbd/cephvol2 activated]
| o- rd_mcp ……………………………………………………………………………………… [0 Storage Object]
o- ib_srpt …………………………………………………………………………………………….. [0 Targets]
o- iscsi ………………………………………………………………………………………………. [2 Targets]
| o- iqn.2003-01.org.linux-iscsi.igw.x86:cephvol1 …………………………………………………………….. [3 TPGs]
| | o- tpg1 …………………………………………………………………………………………….. [disabled]
| | | o- acls …………………………………………………………………………………………….. [0 ACLs]
| | | o- luns ……………………………………………………………………………………………… [1 LUN]
| | | | o- lun0 …………………………………………………………….. [rbd/rbd-cephvol1 (/dev/rbd/rbd/cephvol1)]
| | | o- portals ………………………………………………………………………………………… [1 Portal]
| | | o- 192.168.4.41:3260 ……………………………………………………………………… [OK, iser disabled]
| | o- tpg2 ……………………………………………………………………………………………… [enabled]
| | | o- acls …………………………………………………………………………………………….. [0 ACLs]
| | | o- luns ……………………………………………………………………………………………… [1 LUN]
| | | | o- lun0 …………………………………………………………….. [rbd/rbd-cephvol1 (/dev/rbd/rbd/cephvol1)]
| | | o- portals ………………………………………………………………………………………… [1 Portal]
| | | o- 192.168.4.51:3260 ……………………………………………………………………… [OK, iser disabled]
| | o- tpg3 …………………………………………………………………………………………….. [disabled]
| | o- acls …………………………………………………………………………………………….. [0 ACLs]
| | o- luns ……………………………………………………………………………………………… [1 LUN]
| | | o- lun0 …………………………………………………………….. [rbd/rbd-cephvol1 (/dev/rbd/rbd/cephvol1)]
| | o- portals ………………………………………………………………………………………… [1 Portal]
| | o- 192.168.4.61:3260 ……………………………………………………………………… [OK, iser disabled]
| o- iqn.2003-01.org.linux-iscsi.igw.x86:cephvol2 …………………………………………………………….. [3 TPGs]
| o- tpg1 …………………………………………………………………………………………….. [disabled]
| | o- acls …………………………………………………………………………………………….. [0 ACLs]
| | o- luns ……………………………………………………………………………………………… [1 LUN]
| | | o- lun0 …………………………………………………………….. [rbd/rbd-cephvol2 (/dev/rbd/rbd/cephvol2)]
| | o- portals ………………………………………………………………………………………… [1 Portal]
| | o- 192.168.4.41:3260 ……………………………………………………………………… [OK, iser disabled]
| o- tpg2 ……………………………………………………………………………………………… [enabled]
| | o- acls …………………………………………………………………………………………….. [0 ACLs]
| | o- luns ……………………………………………………………………………………………… [1 LUN]
| | | o- lun0 …………………………………………………………….. [rbd/rbd-cephvol2 (/dev/rbd/rbd/cephvol2)]
| | o- portals ………………………………………………………………………………………… [1 Portal]
| | o- 192.168.4.51:3260 ……………………………………………………………………… [OK, iser disabled]
| o- tpg3 …………………………………………………………………………………………….. [disabled]
| o- acls …………………………………………………………………………………………….. [0 ACLs]
| o- luns ……………………………………………………………………………………………… [1 LUN]
| | o- lun0 …………………………………………………………….. [rbd/rbd-cephvol2 (/dev/rbd/rbd/cephvol2)]
| o- portals ………………………………………………………………………………………… [1 Portal]
| o- 192.168.4.61:3260 ……………………………………………………………………… [OK, iser disabled]
o- loopback ……………………………………………………………………………………………. [0 Targets]
o- qla2xxx …………………………………………………………………………………………….. [0 Targets]
o- tcm_fc ……………………………………………………………………………………………… [0 Targets]
o- vhost ………………………………………………………………………………………………. [0 Targets]
4) Log into the affected ceph-server/iscsi-gateway and reboot
NB: If there are multiple such servers then only reboot one server at a time, and only proceed to the next server to reboot after ensuring that the cluster health is ok (by issuing “ceph status” command)