Clustered file systems are powerful but they should be carefully implemented to avoid split brains, since it is very likely that these lead to data corruption. A very effective way to cope with this risk is SCSI fencing: this trick denies access to the shared disks from nodes that are considered failed by the majority of the nodes of the cluster. The only requisite to implement SCSI fencing is that the shared storage should support SPC-3 Persistent Reservations. This post talks about this topic and explains how to configure a stonith device that exploits SCSI fencing.
What is SCSI fencing
SCSI persistent reservations provide the capability to control the access of each node to shared storage devices. Roughly put, a SCSI reservation consists of registering a key and booking that key.
There are different kinds of reservations that can allow functionality such as write exclusivity: for example, from a specific machine, you can register a key on a SCSI LUN and then get a write exclusive reservation for that key. SCSI persistent reservations can be exploited to achieve storage fencing on High Availability Clusters: this means that it is possible to revoke access to shared storage devices to cluster members that are considered faulty. Pacemaker for example employs SCSI persistent reservations through the use of the fence_scsi agent. Of course the requisite is that the storage supports SCSI persistent reservations.
Registrations
A registration occurs when a node registers a unique key with a device: this means that a device has many registrations, at least one for every accessing node - on a multipath environment the same node registers on a device one time for each path and always the same key. Keys are 6 or 8 digit HEX numbers: for example 0xDEADBEEF or 0x123ABC.
Reservations
A reservation dictates how a device can be accessed. Conversely from registrations, there can be only one reservation on a device at any time. The node that holds the reservation is known as the "reservation holder". The reservation defines how other nodes may access the device.
Pacemaker uses a "Write Exclusive, Registrants Only" reservation: this type of reservation dictates that only nodes that have registered with that device may write to the device.
Fencing
Pacemaker is able to perform fencing of a failed node using SCSI Persistent Reservations (PR) by simply removing a node's registration key from all devices the node is connected to. This prevents the failed node from being able to write to those devices again.
Validating SPC-3 PR support
As you can easily guess, the very first thing to do is ensuring that the connected SCSI storage actually supports SPC-3 Permanent Reservations. The sg3_utils RPM package package, besides other tools, provides sg_persist.
Install it as follows:
yum install -y sg3_utils/code>
sg_persist can be used to perform the check: for example, to test /dev/sda, issue:
sg_persist --in --report-capabilities -v /dev/sda/code>
if the output is as followings, then SPC-3 Persistent Reservations are supported
inquiry cdb: 12 00 00 00 24 00
LIO-ORG www_0_static 4.0
Peripheral device type: disk
Persistent Reservation In cmd: 5e 02 00 00 00 00 00 20 00 00
Report capabilities response:
Compatible Reservation Handling(CRH): 1
Specify Initiator Ports Capable(SIP_C): 1
All Target Ports Capable(ATP_C): 1
Persist Through Power Loss Capable(PTPL_C): 1
Type Mask Valid(TMV): 1
Allow Commands: 1
Persist Through Power Loss Active(PTPL_A): 0
Support indicated in Type mask:
Write Exclusive, all registrants: 1
Exclusive Access, registrants only: 1
Write Exclusive, registrants only: 1
Exclusive Access: 1
Write Exclusive: 1
Exclusive Access, all registrants: 1/code>
otherwise, if is like the following, SPC-3 Persistent Reservations are not supported
inquiry cdb: 12 00 00 00 24 00
VBOX HARDDISK 1.0
Peripheral device type: disk
Persistent Reservation In cmd: 5e 02 00 00 00 00 00 20 00 00
persistent reservation in: Fixed format, current; Sense key: Illegal Request
Additional sense: Invalid command operation code
Info fld=0x0 [0]
PR in (Report capabilities): command not supported
Operating on SPC-3 PR devices
Gathering information
The sg_persist tool can also be used to get information on current registrations
sg_persist -n -i -k -d /dev/mapper/36001405973e201b3fdb4a999175b942f
an example of output is as follows:
PR generation=0x4a, 6 registered reservation keys follow:
0x9b0e0000
0x9b0e0000
0x9b0e0002
0x9b0e0002
0x9b0e0001
0x9b0e0001
Among these registrations, there should be a reservation: we can use sg_persist tool to gather this information too:
sg_persist -n -i -r -d /dev/mapper/36001405973e201b3fdb4a999175b942f
an example of output is as follows:
PR generation=0x4a, Reservation follows:
Key=0x9b0e0000
scope: LU_SCOPE, type: Write Exclusive, registrants only
By the output of the previous commands we can guess that:
- there are 3 registered nodes: 0x9b0e0000, 0x9b0e0002 and 0x9b0e0001 on a multipath SAN (that's why they are shown twice)
- node 0x9b0e0000 is the one with the reservations
In such a scenario, node 0x9b0e0000 is the one that can remove the key of failed nodes to fence them. A node entitled to do so is called fencing node.
Registering and Reserving a key
We can use sg_persist tool to register a key
sg_persist --out --register --param-sark=0xDEADBEEF /dev/mapper/36001405973e201b3fdb4a999175b942f
same way we can reserve a key
sg_persist --out --reserve --param-rk=0xDEADBEEF --prout-type=5 /dev/mapper/36001405973e201b3fdb4a999175b942f
--prout-type command option is used to specify type of reservation. Valid values are the following ones:
- 1 = write exclusive
- 3 = exclusive access
- 5 = write exclusive - registrants only
- 6 = exclusive access - registrants only
- 7 = write exclusive - all registrants
- 8 = exclusive access - all registrants
For more information issue:
man sg_persist
Releasing and removing a key
You can remove registrations and release keys but you have to do it in the right order and in a way that is in line with the restrictions the keys impose. For example:
- you cannot release a key before removing the registrations associated with this key
- you cannot remove an exclusive reservation from a node other than the one that registered it
and so on.
In multipath environments, attempts to release the reservation through the path that didn't request the reservation fail, so you should retry the release using the next available multipath device until you reach the one that has been used to request the reservation.
The following command release a registration that was previously requested from /dev/sda
sg_persist --out --release --param-rk= --prout-type=5 /dev/<DEVICE>
the following command unregisters a key that was previously requested from /dev/sda
sg_persist --out --register --param-rk= /dev/sda
the following command clear the reservation that was previously requested from /dev/sda along with all the registered keys
sg_persist --out --clear --param-rk= /dev/sda
SPC-3 PR Fencing with Pacemaker
Now we ahve all the necessary skill to setup and operate SCSI fencing on Pacemaker by creating a stonith agent that uses fence_scsi:
Create a SCSI fencing device
Create the stonith agent that uses fence_scsi:
pcs stonith create scsi fence_scsi pcmk_host_list="www01 www02 www03" \
pcmk_monitor_action="metadata" pcmk_reboot_action="off" \
devices="/dev/mapper/3600140528f638e683a4426482cf1655c" \
meta provides="unfencing"
verify configuration:
pcs stonith show scsi
the output should look like as follows:
Resource: scsi (class=stonith type=fence_scsi)
Attributes: pcmk_reboot_action=off devices=/dev/mapper/36001405973e201b3fdb4a999175b942f
Meta Attrs: provides=unfencing
Operations: monitor interval=60s (scsi-monitor-interval-60s)
pcs property set no-quorum-policy=freeze
test fencing
We can no test the SCSI fencing we configured: first and foremost list registered keys
Are you enjoying these high quality free contents on a blog without annoying banners? I like doing this for free, but I also have costs so, if you like these contents and you want to help keeping this website free as it is now, please put your tip in the cup below:
Even a small contribution is always welcome!
sg_persist -n -i -k -d /dev/mapper/36001405973e201b3fdb4a999175b942f
output should look like as follows:
PR generation=0x37, 6 registered reservation keys follow:
0x9b0e0000
0x9b0e0000
0x9b0e0001
0x9b0e0001
0x9b0e0002
0x9b0e0002
fence node www02: the outcome is that node key 0x9b0e0001 gets dropped
pcs stonith fence www02
list registered keys
sg_persist -n -i -k -d /dev/mapper/36001405973e201b3fdb4a999175b942f
output should look like as follows:
PR generation=0x38, 4 registered reservation keys follow:
0x9b0e0000
0x9b0e0000
0x9b0e0002
0x9b0e0002
if as in the previous output 0x9b0e0001 key is missing, then SCSI stonith is working and www02 node have been disconnected from the LUN.
Troubleshooting
Verify defined stonith devices defined in Pacemaker - note that this command requires that the cluster has been started
stonith_admin -L
the output should look like as follows:
scsi
1 devices found
Verify that stonith device can reach one of the nodes - note that this command requires that the cluster has been started
stonith_admin -l www02
the output should look like as follows:
scsi 1 devices found
Footnotes
Here it ends our guided tour to SCSI Persistent Reservations and how to exploit it to implement SCSI fencing on Pacemaker: I hope you enjoyed it. Consider using it each time you setup a high-available cluster that makes use of shared storage, since it can dramatically limit data corruption risks when split brains occur.
sami says:
Best explanation if found on this topic . Very clear and didactic . Many thx !
Marco Antonio Carcano says:
Nice to know you liked it Sami