Clustered file systems are powerful but they should be carefully implemented to avoid split brains, since it is very likely that these lead to data corruption. A very effective way to cope with this risk is SCSI fencing: this trick denies access to the shared disks from nodes that are considered failed by the majority of the nodes of the cluster. The only requisite to implement SCSI fencing is that the shared storage should support SPC-3 Persistent Reservations. This post talks about this topic and explains how to configure a stonith device that exploits SCSI fencing.

What is SCSI fencing

SCSI persistent reservations provide the capability to control the access of each node to shared storage devices. Roughly put, a SCSI reservation consists of registering a key and booking that key.

There are different kinds of reservations that can allow functionality such as write exclusivity: for example, from a specific machine, you can register a key on a SCSI LUN and then get a write exclusive reservation for that key. SCSI persistent reservations can be exploited to achieve storage fencing on High Availability Clusters: this means that it is possible to revoke access to shared storage devices to cluster members that are considered faulty. Pacemaker for example employs SCSI persistent reservations through the use of the fence_scsi agent. Of course the requisite is that the storage supports SCSI persistent reservations.

Registrations

A registration occurs when a node registers a unique key with a device: this means that a device has many registrations, at least one for every accessing node - on a multipath environment the same node registers on a device one time for each path and always the same key. Keys are 6 or 8 digit HEX numbers: for example 0xDEADBEEF or 0x123ABC.

Reservations

A reservation dictates how a device can be accessed. Conversely from registrations, there can be only one reservation on a device at any time. The node that holds the reservation is known as the "reservation holder". The reservation defines how other nodes may access the device.

Pacemaker uses a "Write Exclusive, Registrants Only" reservation: this type of reservation dictates that only nodes that have registered with that device may write to the device.

Fencing

Pacemaker is able to perform fencing of a failed node using SCSI Persistent Reservations (PR) by simply removing a node's registration key from all devices the node is connected to. This prevents the failed node from being able to write to those devices again.

Validating SPC-3 PR support

As you can easily guess, the very first thing to do is ensuring that the connected SCSI storage actually supports SPC-3 Permanent Reservations. The sg3_utils RPM package package, besides other tools, provides sg_persist.

Install it as follows:

yum install -y sg3_utils/code>

sg_persist can be used to perform the check: for example, to test /dev/sda, issue:

sg_persist --in --report-capabilities -v /dev/sda/code>

if the output is as followings, then SPC-3 Persistent Reservations are supported

      inquiry cdb: 12 00 00 00 24 00 
    LIO-ORG   www_0_static      4.0
    Peripheral device type: disk
      Persistent Reservation In cmd: 5e 02 00 00 00 00 00 20 00 00 
  Report capabilities response:
    Compatible Reservation Handling(CRH): 1
    Specify Initiator Ports Capable(SIP_C): 1
    All Target Ports Capable(ATP_C): 1
    Persist Through Power Loss Capable(PTPL_C): 1
    Type Mask Valid(TMV): 1
    Allow Commands: 1
    Persist Through Power Loss Active(PTPL_A): 0
      Support indicated in Type mask:
        Write Exclusive, all registrants: 1
        Exclusive Access, registrants only: 1
        Write Exclusive, registrants only: 1
        Exclusive Access: 1
        Write Exclusive: 1
        Exclusive Access, all registrants: 1/code>

otherwise, if is like the following, SPC-3 Persistent Reservations are not supported

      inquiry cdb: 12 00 00 00 24 00
    VBOX      HARDDISK          1.0
      Peripheral device type: disk
        Persistent Reservation In cmd: 5e 02 00 00 00 00 00 20 00 00
        persistent reservation in:  Fixed format, current;  Sense key: Illegal Request
        Additional sense: Invalid command operation code
        Info fld=0x0 [0]
        PR in (Report capabilities): command not supported

Operating on SPC-3 PR devices

Gathering information

The sg_persist tool can also be used to get information on current registrations

sg_persist -n -i -k -d /dev/mapper/36001405973e201b3fdb4a999175b942f

an example of output is as follows:

  PR generation=0x4a, 6 registered reservation keys follow:
    0x9b0e0000
    0x9b0e0000
    0x9b0e0002
    0x9b0e0002
    0x9b0e0001
    0x9b0e0001

Among these registrations, there should be a reservation: we can use sg_persist tool to gather this information too:

sg_persist -n -i -r -d /dev/mapper/36001405973e201b3fdb4a999175b942f

an example of output is as follows:

   PR generation=0x4a, Reservation follows:
    Key=0x9b0e0000
    scope: LU_SCOPE,  type: Write Exclusive, registrants only

By the output of the previous commands we can guess that:

  • there are 3 registered nodes: 0x9b0e0000, 0x9b0e0002 and 0x9b0e0001 on a multipath SAN (that's why they are shown twice)
  • node 0x9b0e0000 is the one with the reservations

In such a scenario, node 0x9b0e0000 is the one that can remove the key of failed nodes to fence them. A node entitled to do so is called fencing node.

As you see, it is shown only the key: in a multipath environment this can generate confusions when it comes to releasing a registration: the registration can be released only by the requestor node through the same path that acquired the registration. This means that when manually doing this you should try to release the key on each of the multipath devices until you get the one that has been used to request the reservation.

Registering and Reserving a key

We can use sg_persist tool to register a key

sg_persist --out --register --param-sark=0xDEADBEEF /dev/mapper/36001405973e201b3fdb4a999175b942f

same way we can reserve a key

sg_persist --out --reserve --param-rk=0xDEADBEEF --prout-type=5 /dev/mapper/36001405973e201b3fdb4a999175b942f

--prout-type command option is used to specify type of reservation. Valid values are the following ones:

  • 1 = write exclusive
  • 3 = exclusive access
  • 5 = write exclusive - registrants only
  • 6 = exclusive access - registrants only
  • 7 = write exclusive - all registrants
  • 8 = exclusive access - all registrants

For more information issue:

man sg_persist

Releasing and removing a key

Removing SCSI reservations without understanding of how the application is using them can be problematic and may lead to data corruption or other unexpected behavior. Don't play with these commands on a production LUN.

You can remove registrations and release keys but you have to do it in the right order and in a way that is in line with the restrictions the keys impose. For example:

  • you cannot release a key before removing the registrations associated with this key
  • you cannot remove an exclusive reservation from a node other than the one that registered it

and so on.

In multipath environments, attempts to release the reservation through the path that didn't request the reservation fail, so you should retry the release using the next available multipath device until you reach the one that has been used to request the reservation.

The following command release a registration that was previously requested from /dev/sda

sg_persist --out --release --param-rk= --prout-type=5 /dev/<DEVICE>

the following command unregisters a key that was previously requested from /dev/sda

sg_persist --out --register --param-rk= /dev/sda

the following command clear the reservation that was previously requested from /dev/sda along with all the registered keys

sg_persist --out --clear --param-rk= /dev/sda

SPC-3 PR Fencing with Pacemaker

Now we ahve all the necessary skill to setup and operate SCSI fencing on Pacemaker by creating a stonith agent that uses fence_scsi:

Create a SCSI fencing device

Create the stonith agent that uses fence_scsi:

pcs stonith create scsi fence_scsi pcmk_host_list="www01 www02 www03" \
pcmk_monitor_action="metadata" pcmk_reboot_action="off" \
devices="/dev/mapper/3600140528f638e683a4426482cf1655c" \
meta provides="unfencing"

verify configuration:

pcs stonith show scsi

the output should look like as follows:

 Resource: scsi (class=stonith type=fence_scsi)
  Attributes: pcmk_reboot_action=off devices=/dev/mapper/36001405973e201b3fdb4a999175b942f 
  Meta Attrs: provides=unfencing 
  Operations: monitor interval=60s (scsi-monitor-interval-60s)
In such a scenario it is also best to configure Pacemaker to freeze when quorum is lost:

pcs property set no-quorum-policy=freeze

test fencing

We can no test the SCSI fencing we configured: first and foremost list registered keys

sg_persist -n -i -k -d /dev/mapper/36001405973e201b3fdb4a999175b942f

output should look like as follows:

 PR generation=0x37, 6 registered reservation keys follow:
    0x9b0e0000
    0x9b0e0000
    0x9b0e0001
    0x9b0e0001
    0x9b0e0002
    0x9b0e0002

fence node www02: the outcome is that node key 0x9b0e0001 gets dropped

pcs stonith fence www02

list registered keys

sg_persist -n -i -k -d /dev/mapper/36001405973e201b3fdb4a999175b942f

output should look like as follows:

  PR generation=0x38, 4 registered reservation keys follow:
    0x9b0e0000
    0x9b0e0000
    0x9b0e0002
    0x9b0e0002

if as in the previous output 0x9b0e0001 key is missing, then SCSI stonith is working and www02 node have been disconnected from the LUN.

Troubleshooting

Verify defined stonith devices defined in Pacemaker - note that this command requires  that the cluster has been started

stonith_admin -L

the output should look like as follows:

 scsi
1 devices found

Verify that stonith device can reach one of the nodes - note that this command requires  that the cluster has been started 

stonith_admin -l www02

the output should look like as follows:

 scsi
1 devices found

Footnotes

Here it ends our guided tour to SCSI Persistent Reservations and how to exploit it to implement SCSI fencing on Pacemaker: I hope you enjoyed it. Consider using it each time  you setup a high-available cluster that makes use of shared storage, since it can dramatically limit data corruption risks when split brains occur. 

Writing a post like this takes a lot of hours. I'm doing it for the only pleasure of sharing knowledge and thoughts, but all of this does not come for free: it is a time consuming volunteering task. This blog is not affiliated to anybody, does not show advertisements nor sells data of visitors. The only goal of this blog is to make ideas flow. So please, if you liked this post, spend a little of your time to share it on Linkedin or Twitter using the buttons below: seeing that posts are actually read is the only way I have to understand if I'm really sharing thought or if I'm just wasting time and I'd better give up.

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>