Some businesses are so sensitive to service outages to require a 24×7 Service Level Agreement. Guaranteeing 99.999% of uptime means a cumulative service outage of not more than 5 minutes per year. This impressive goal can be achieved by setting-up High-Available clusters.

Linux professionals should have a very good understanding of clustering, being able to guess the right cluster type that should be setup – Opt-in (asymmetrical) or Opt-Out (symmetrical) – and the clustering mode suitable for each service (Active-Active, Active-Passive).

Of course this requires be skilled on several softwares, such as:

  • Corosync – cluster engine
  • Pacemaker – cluster resource manager

Often it is required also to set up redundant filesystems; these are often but not always clustered: this deeply depends upon the particular use case. So Linux professionals should so also be skilled onto:

  • DRBD – a kernel module to replicate block devices over the network
  • Setting the proper LVM locking type when not using clustered file systems
  • Clustered LVM – enabling the distributed lock manager on LVM
  • GFSv2 – a clustered file system

In addition to that, having a good understanding of Gluster – a resilient and replicated storage suite – may certainly help for some kinds of workloads.

NetApp is certainly one of the most popular storage brands, with quite a big portfolio of cost-effective storage solutions. With the spreading of Kubernetes, they developed Trident, their own Container Storage Interface (CSI) compatible storage orchestrator.

In the "NetApp Astra Trident Tutorial NFS Kubernetes CSI HowTo" we see how easy it is to deploy and set it up.

Kubernetes is certainly the most popular and probably the best solution for orchestrating containerized workloads, but maintaining its vanilla distribution is certainly a challenge, so you must carefully guess pros and cons in terms of maintenance costs and operational risks.

A very cost effective and interesting alternative to running the vanilla Kubernetes is the "Rancher Kubernetes Engine 2" (RKE2), a certified Kubernetes distribution focused on security to adhere to the U.S. government’s compliance requirement. RKE2, besides providing a reliable Kubernetes distribution, smoothly integrates with Rancher.

In the "RKE2 Tutorial - RKE2 Howto On Oracle Linux 9" post we see it in action, installing a highly available multi-master Kubernetes cluster, exposing the default ingress controller using MetalLB for providing Load Balancing services.

Read more >

One of the HAProxy strengths is not being very strict about its configuration structure, enabling it to create configurations suitable for fitting very messy scenarios. Sadly this is also its biggest maintainability pitfall: especially if you want to automate its configuration using automation tools and templates, it is up to you to define the best possible standard configuration structure fitting your needs.

The "HAProxy Tutorial - A Clean And Tidy Configuration Structure" post is an insight providing guidelines on how to structure the HAProxy configuration in an effective way, promoting the sharing of floating IP addresses and using easy to edit maps for load balancing the traffic forwarding it to the correct destination. In addition to that, it also provides a way for splitting the statistics so to have them displayed only for the scope of each specific balanced service instead of as a whole.

HAProxy is certainly one of the most blazoned, fast and efficient (in terms of processor and memory usage) open source load balancer and proxy, enabling both TCP and HTTP-based applications to spread requests across multiple servers. In the "High Available HA Proxy Tutorial With Keepalived" we see not only how to install it in a High Available fashion, but also how to set the configuration in a clean and tidy way, having it automatically fetched from a Git remote repository.

Clustered file systems are powerful but they should be carefully implemented to avoid split brains, since it is very likely that these lead to data corruption. A very effective way to cope with this risk is SCSI fencing: this trick denies access to the shared disks from nodes that are considered failed by the majority of the nodes of the cluster. The only requisite to implement SCSI fencing is that the shared storage should support SPC-3 Persistent Reservations. This post talks about this topic and explains how to configure a stonith device that exploits SCSI fencing.

Read more >