It is almost impossible not having heard about or not having used LVM: it is one of the pillars of every Linux distribution from decades ago. Almost everyone using Linux has used it to create or modify the basic storage structures of its Linux system. The trouble is that very often people are focused on the specific task they are onto, and neglect the time to investigate its amazing features. The goal of LVM Tutorial - A thorough howto on the Logical Volume Manager is to provide an easy yet comprehensive explanation on the most interesting features of LVM that it is very likely you will need to use sooner or later.

What is the Logical Volume Manager

LVM is a Logical Volume Manager for the Linux operating system developed many years ago by Sistina Software.

A Logical Volume Manager provides a layer of abstraction of the disk storage more resilient than the traditional view of disks and partitions: from the storage consumer perspective, volumes under the control of the Logical Volume Manager can be extended but also shrinked in size.

In addition to that, data blocks can be moved across the storage devices at wish: from the hardware perspective, this enables flushing all the data from a storage device into the others, for example to replace it with a new one with no service outages. Besides these basic features, LVM also supports thin-provisioning, snapshots, the use of goldens and it can be used to support High Available clusters.

Clustered LVM is a huge topic: describing its usage, besides the clustering of the LVM itself, requires explaining clustered file systems such as GFSv2 and resource clustering softwares such as cluster management systems. It would require too much space to talk about this within this post, ... but maybe I will write a post dedicated to this topic sooner or later.

Physical Volumes and Volume Groups

The Volume Group (VG) is the highest abstraction level within LVM: it is made of all the available storage coming from a set of storage devices (can either be whole disks or partitions of disks) called Physical Volumes (PV).

IMHO there are no actual benefits into partitioning a disk used as a Physical Volume: the only good reason for doing this is when defining the partition scheme of the system disk, creating the /boot mount point on a dedicated partition and a dedicated partition for the Physical Volume of the system Volume Group.
Having disks with a single partition dedicated to PV is only cumbersome to maintain: for example, when it comes to increasing the size of the PV, you need to extend the partition along with the disk device itself since the PV is the partition itself. But extending a partition very often requires deleting and recreating it, a task that requires a lot of attention to avoid errors. In this post however i show both methods, since you may need to operate on partition based PV, despite I strongly advise you against working like so.

Logical Volumes

The VG can then be split into partitions called Logical Volumes LV. Of course a system may have more VG with multiple LV each. What is not possible is to share the same PV between VG: a PV belongs to one and only one VG.
You can think of LV as partitions of a huge logical storage (VG) that spans on more devices (PV) that can be either whole disks or partitions of disks. This of course implies that a LV can span across more than one disk, and that of course can have a size up to the sum of all the sizes of the PV.

Technical Details

At the time of writing this post, the upper bound for LV size in LVM2 format volumes (assuming 64-bit arch, 2.6+ kernel) is 8 ExaBytes.

Physical Extent (PE)

The Physical Extents (PE), is the smallest unit of allocatable storage when dealing with LVM: you must carefully chose its size depending on the workload of the Volume Group when creating a VG: if the PE size parameter is omitted, it defaults to 4MB. As you can easily guess, you cannot change the PE size once you create the VG. Whenever a LV is created, extended and even shrink. As soon as a PV is added to a VG, it gets split into a number of chunks of data (the PEs) equal to the size of the PV divided by the size of the chunk. These PEs are then assigned to the pool of free extents.

Logical Extent (LE) 

This is the minimum block you can assign to a Logical Volume (LV): Logical Extents (LE) are actually PE taken from the pool of free PE and assigned to a Logical Volume. So you can think of PE and LE as two different ways to refer to the same thing depending on the Logical (LE) or Physical (PE) perspective

Logical Extent (LE) Allocation And Release

LEs are assigned to the LV or returned to the pool of free LE of the VG.

Physical Extent (PE) Allocation

Concerning which PV is used to physically allocate the underlying PEs of a LV, it is determined by the so called allocation policy: for example linear mapping policy try to assign PE of the same PV that are next one another; striped mapping instead interleaves PE across all the available PV..

Mind that almost all of the following commands require administrative privileges, so you must run them as the root user, or prepend "sudo" to each of them.

Discovering existing LVM

The available LVM configuration is automatically guessed at boot time: the system scans if a PV exists and if at least one is found, it looks for LVM metadata to guess the VG the PV belongs to and the LVs defined inside it.

Despite this being automatically done, it is better to know how to manually launch these scans: you need to know them when troubleshooting boot problems.

The scan for LVM metadata can be accomplished by typing the following three commands in the same sequence as follows:

pvscan
vgscan
lvscan

after running these commands all the available LVM devices gets discovered.

When running pvscan, you must get something like the following message:

 PV /dev/sdb lvm2 [10.00 GiB]
PV /dev/sdc lvm2 [10.00 GiB]
Total: 2 [20.00 GiB] / in use: 0 [0 ] / in no VG: 2 [20.00 GiB]

In this example we pvscan found two devices ("/dev/sdb" and "/dev/sdc") of 10GB each in size.

If instead you get the following message:

No matching physical volumes found

it means that your system is not using LVM at all. A typical scenario where this often happens is when using prepackaged Vagrant boxes, since most of the time they only use basic partitions.

Listing existing LVM

Listing Physical Volumes

Type the following command:

pvdisplay
If you get no output, it means you have no PV on your system yet: just skip to the "Working With PV"  chapter, then come back here to see how to gather information on the available PV on your system.

on my system, the output is as follows:

  "/dev/sdb" is a new physical volume of "10.00 GiB"
  --- NEW Physical volume ---
  PV Name               /dev/sdb
  VG Name               
  PV Size               10.00 GiB
  Allocatable           NO
  PE Size               0   
  Total PE              0
  Free PE               0
  Allocated PE          0
  PV UUID               H7VllH-l0Lk-cT0B-PCg5-X4w9-LkUp-CeFsdc
   
  "/dev/sdc" is a new physical volume of "10.00 GiB"
  --- NEW Physical volume ---
  PV Name               /dev/sdc
  VG Name               
  PV Size               10.00 GiB
  Allocatable           NO
  PE Size               0   
  Total PE              0
  Free PE               0
  Allocated PE          0
  PV UUID               0hGOEg-BjSJ-0xPw-UWlc-9O8j-mdob-tDpZgg

it shows that  both "/dev/sdb" and /dev/sdc" are brand new PVs, indeed:

  • "VG Name" is empty
  • every information concerning PE is 0.

you may prefer to get only summary information:

pvs

on my system, the output is as follows:

  PV         VG Fmt  Attr PSize  PFree 
  /dev/sdb      lvm2 ---  10.00g 10.00g
  /dev/sdc      lvm2 ---  10.00g 10.00g

When it comes to scripting, the most convenient command to use is pvs, since it provides convenient command line options to output only specific fields: there are lots of fields by the way. For example you can see alignment information by specifying "-o +pe_start" as follows:

pvs -o +pe_start --units k

Listing Volume Groups

Type the following command:

vgdisplay
If you get no output, skip to the "Working With VG" , then come back here to see how to gather information on the available VG on your system.

on my system, the output is as follows:

  --- Volume group ---
  VG Name               data
  System ID             
  Format                lvm2
  Metadata Areas        2
  Metadata Sequence No  1
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                0
  Open LV               0
  Max PV                0
  Cur PV                2
  Act PV                2
  VG Size               19.99 GiB
  PE Size               4.00 MiB
  Total PE              5118
  Alloc PE / Size       0 / 0   
  Free  PE / Size       5118 / 19.99 GiB
  VG UUID               QndPIg-Sn6h-lNgL-wNDt-1OPx-4X6c-5OORrV

It shows that "data" VG has been set with a PE size of 4MiB and it has never been used, since Allocated PE is 0 out of 5118 available.

if you want to get summary information just type:

vgs

the output on my system is:

  VG   #PV #LV #SN Attr   VSize  VFree 
  data   2   0   0 wz--n- 19.99g 19.99g

Since you can have more than just one VG on your system, you can of course focus both commands to a specific VG as follows:

vgdisplay data
vgs data

Listing Logical Volumes

Type the following command:

lvdisplay
If you get no output, skip to the "Working With LV" , then come back here to see how to gather information on the available LV on your system.

on my system, the output is as follows:

  --- Logical volume ---
  LV Path                /dev/data/pgsql_data
  LV Name                pgsql_data
  VG Name                data
  LV UUID                3OPj5y-Uio5-pxQt-cY9F-zjfK-QymG-iCdiM7
  LV Write Access        read/write
  LV Creation host, time localhost.localdomain, 2022-11-08 22:30:54 +0100
  LV Status              available
  # open                 0
  LV Size                16.00 GiB
  Current LE             4096
  Segments               2
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     8192
  Block device           253:0

that is we have only one LV called "pgsql" that belongs to VG "data".
To get summary information, just type:

lvs

the output on my system is as follows:

  LV         VG   Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  pgsql_data data -wi-a----- <16.00g

Since you can have more than just one LV on your system, you can of course focus both commands to a specific LV as follows:

lvdisplay data
lvs data

When it comes to scripting, the most convenient command to use is lvs, since it provides convenient command line options to output only specific fields: there are lots of fields by the way. For example you can add the devices field to the output of lvs by adding  "-o +devices" as follows:

lvs -o +devices

on my system, the output is as follows:

  LV         VG   Attr       LSize  Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert Devices    
  pgsql_data data -wi-a----- 16.00g                                                     /dev/sdb(0)
  pgsql_data data -wi-a----- 16.00g                                                     /dev/sdc(0)

Operating with PV

Creating a PV

The easiest, straightforward and more scalable way of creating a Physical Volume is to simply use a whole device as a PV, without prior partitioning it.

This of course requires an empty disk, so if your VM does not have a free one yet, attach it and rescan the SCSI bus as follows:

for host in $(ls -d /sys/class/scsi_host/host*); do echo "- - -" > $host/scan; done

if instead your hypervisor does not support hot-plug for disks, shutdown the machine, attach a new virtual disk and boot the virtual machine again.

Once added the new disk, we can list the available block devices by typing:

lsblk

the output on my system is as follows:

NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
sda      8:0    0 78.1G  0 disk 
├─sda1   8:1    0  200M  0 part /boot/efi
├─sda2   8:2    0    1G  0 part /boot
├─sda3   8:3    0   16G  0 part [SWAP]
└─sda4   8:4    0 60.9G  0 part /
sdb      8:16   0   10G  0 disk
sdc      8:32   0   10G  0 disk
sr0     11:0    1 1024M  0 rom 68

As you see "sdb" and "sdc" are the only unpartitioned disk devices, so the new disks have been added as "sdb" and "sdc".

To configure "/dev/sdb" and "/dev/sdc" as a Physical Volumes just type:

pvcreate /dev/sdb /dev/sdc

on my system the output is as follows:

  Physical volume "/dev/sdb" successfully created.
  Physical volume "/dev/sdc" successfully created.
  Creating devices file /etc/lvm/devices/system.devices

As previously mentioned, there's a school of thought (I do not agree by the way) that states that even if the storage device is fully dedicated to the Physical Volume, you must anyway create a partition on the storage device and use it as a PV.

So for example, following this theory, if "/dev/sdb" would be the empty disk you want to use as a PV you would have to prior partition it as follows
create /dev/sdb1 partition:

parted /dev/sdb --script mklabel gpt
parted -a optimal /dev/sdb --script mkpart volume 1M 100%
parted -a optimal /dev/sdb --script set 1 lvm on

and only then create the PV label into "/dev/sdb1" partition you have just created:

pvcreate /dev/sdb1

As previously told, IMHO there are more cons than pros into using this approach.

Dealing With Alignment On Physical Disks

Until you are working with virtual disks or LUNs from a SAN you don't have to worry about alignment, but when it comes to using physical devices you must take care of it.

When dealing with simple disks, most of the time you can blindly align at 1MiB.

The alignment of the PV to create is specified using the "--dataalignment" command option: for example, to create a PV aligned at 1MiB, just type:

pvcreate --dataalignment 1MiB /dev/sdb

Dealing With Alignment On RAID Devices

If you are on the top of a RAID array things become a little bit more complex, since you must guess the chunk size that was set when configuring the array. The general rule to calculate the PV data alignment is multiplying the chunk size by the number of disks in the RAID array without counting the parity level (RAID 5 has one parity, RAID 6 has double parity).

Of course there are exceptions, for example:

  • RAID1 does not have parity: with RAID 1 the number of stripes is equal to the number of disks
  • with RAID 1+0 the number of stripes is equal to half of the disks

and so on.

Guessing the chunk size deeply depends on the RAID implementation.

For example, when dealing with hardware RAID, if you have a LSI MegaRAID controllers, you can use  guess the chunk size using storcli64 command line utility as follows:

./storcli64 /c0/v1 show all | grep Strip

the output would be similar to the following one:

     Stripe Size : 128K

Conversely, when dealing with software RAID, you can just rely on the mdadm command line utility inspecting any disk that is part of the array as follows:

mdadm -E /dev/sdb | grep "Chunk Size"

the output would be similar to the following one:

     Chunk Size : 128K

As an example, if we have  an hypothetical "/dev/md127" configured as RAID5 with 4 disks, chunk size=128K, the PV data alignment would be 128K * (4-1) = 128K * 3 = 384K.

So the command to create a properly aligned PV would be:

pvcreate -M2 --dataalignment 384K /dev/md127

Resizing a PV

In this post I show you in practice only how to increase the size of the Physical Volume - it is very unlikely that someone needs to shrink it (despite once in my life it happened to me). Anyway , for your information, mind that despite shrinking a Physical Volume is technically possible, you must be very careful: used PEs may be scattered on the PV, so you must mandatorily compact them first (see "Manually relocating PEs across the PVs" for the details) and only after this you can safely shrink the PV. An easy way may be to add a new PV, move the PE of the PV you want to shrink to the new one, shrink the PV, move back the PE to the shrinked PV and remove the new PV. We will see how to move PE from PV later on.

The very first thing to do is of course to increase the size of the disk - the way of doing this deeply depends on the virtualization software / RAID implementation or the SAN you are using.

As a rule of the thumb, when dealing with virtual disks - as opposed to LUN - especially if you are using features such as VMWare Storage V-Motion, you must avoid to increase the size of a disk over 100GB: big disks are more complicated to be migrated, since they takes more time and more available space on the nodes of the Storage cluster.

Once the storage device has been increased, rescan it so that the kernel gets informed that the size is changed. For example, if the resized device is "/dev/sdb", type:

echo 1 > /sys/block/sdb/device/rescan

then things differ a little if the PV is a whole storage device or a partition of a storage device.

PV Created On A Whole Disk

Let's see the partitions defined on sdb:

lsblk /dev/sdb

the output must be similar to the following one:

NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
sdb      8:16   0   10G  0 disk 

as you see ther are no partitions defined on /dev/sdb: we can directly resize the Physical Volume as follows:

pvresize /dev/sdb

the output on my system is as follows:

  Physical volume "/dev/sdb" changed
  1 physical volume(s) resized or updated / 0 physical volume(s) not resized 

PV Created On A Partition

Using "/dev/sdb" as example, if the output of the command:

lsblk /dev/sdb

is as the following:

NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
sdb      8:16   0   10G  0 disk 
└─sdb1   8:17   0    8G  0 part 

and the PV is "/dev/sdb1", which means that the PV has been created into a partition.

In such a scenario, in order to resize the PV we have to resize the partition as well - since parted does not support the resize option anymore, we have to remove and recreate the partition:

parted -a optimal /dev/sdb --script rm 1
parted -a optimal /dev/sdb --script mkpart volume 1M 100%
parted -a optimal /dev/sdb --script set 1 lvm on

since we specified 100%, the new partition takes the whole disk. Now that we increased the partition, we must notify LVM that PV size has changed as follows:

pvresize /dev/sdb1

the output must be as follows:

  Physical volume "/dev/sdb1" changed
  1 physical volume(s) resized or updated / 0 physical volume(s) not resized 

Deleting a PV

Sometimes it may happen having the need to delete a PV. For example, you may want to replace a slow performance disk holding a PV with an high performance one, and re-use the old performance disk for something else.
This requires you to set the high-performance disk as a new PV, remove the PV on the low performance disk from the VG and eventually delete the PV label from the low performance disk.

You cannot simply delete a PV assigned to a VG: before deleting the PV you should remove it from the VG it does belong to. This operation is said “reducing a VG” since you are actually reducing the VG in size - for more information read the relevant paragraph beneath "Working With VG".

Deleting a PV is simple: for example, if the PV is "/dev/sdb" disk:

pvremove /dev/sdb
lvmdevices --deldev /dev/sdb

if instead we are dealing with a partition as PV, such as "/dev/sdb1":

pvremove /dev/sdb1
lvmdevices --deldev /dev/sdb1

pvremove wipes the LVM label from the device, whereas  lvmdevices --deldev removes the related entry from the "/etc/lvm/devices/system.device".

If you forget to run the lvmdevices statement, when typing any LVM command you will get messages as the following one:

Devices file sys_wwid t10.ATA_____rocky_default_1667937593571_37531-0_SSD_9DZXE76HHYJ412N6YMGD PVID none last seen on /dev/sdb1 not found.

REplacing a PV

This is accomplished by removing from the VG the PV you want to replace - see "Reducing a VG'' for the details. Anyway, mind that the VG must have enough free space to hold the PEs that will be migrated from the PV you will remove to the rest of the PVs. If there's not enough space in the remaining VGs, you must extend it first - see "Extending a VG '' for the details.

Working with VG

Creating a VG

To create a VG, you must use the vgcreate command followed by the name you want to give to the VG along with the list of PV you want to assign to it.

For example, to create a VG called "data" that spans across the "/dev/sdb" and "/dev/sdb" PVs type:

vgcreate data /dev/sdb /dev/sdc

Of course, set both "/dev/sdb" and "/dev/sdb" must have been already configured as PV.

It is of course not necessary to have more than one PV: you can safely create VGs also with just one PV, and extend them as needed adding other PVs later on. Anyway mind that having more than just one PV can boost performances - you can create striped Logical Volumes indeed. More information on this topic later on.

When the underlying storage device of the PV is a RAID that performs striping, it is mandatory to set the PE size : getting back to the previous example with a stripe size of 384K, we must set PE size to a multiple of it, so for example 3840K. This can be achieved by adding -s command option as follows:

vgcreate -s 3840K data /dev/md127
LVM documentation states that PE size must be equal to stripe size or to a multiple of it.

Renaming a VG

You may of course need to rename an existing VG: this can be accomplished by using the vgrename command. For example, to rename the "foovg" VG in "data" just type:

vgrename foovg data

Extending a VG

It is very likely that you run out of space (Free PE) on a VG sooner or later

Extending a VG can be accomplished either in the following two ways:

  • extending the PV that are already members of the VG (for more information, see Resizing a PV)
  • adding new PVs (see below).

Resize VG by adding a new disk

Extending a VG this way is really simple. You just need a new disk to be used as PV, so create a virtual disk or LUN and then rescan the SCSI bus on every SCSI host as follows:

for host in $(ls -d /sys/class/scsi_host/host*); do echo "- - -" > $host/scan; done

then make sure the new disk (or disks) have become available to the system:

lsblk

if everything properly worked out, the new disk (or disks) must be listed in the output:

NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
sda      8:0    0 78.1G  0 disk 
├─sda1   8:1    0  200M  0 part /boot/efi
├─sda2   8:2    0    1G  0 part /boot
├─sda3   8:3    0   16G  0 part [SWAP]
└─sda4   8:4    0 60.9G  0 part /
sdb      8:16   0   10G  0 disk 
sdc      8:32   0   10G  0 disk
sdd      8:48   0   10G  0 disk
sde      8:64   0   10G  0 disk
sr0     11:0    1 1024M  0 rom  

In this example I added two new disks, that are seen as "/dev/sdd" and "/dev/sde" - we can add them to the "data" VG as follows:

vgextend data /dev/sdd /dev/sde

if instead we would dealing with a partition as new PVs, such as "/dev/sdd1" and "/dev/sde1", the command would be:

vgextend data /dev/sdd1 /dev/sde1

Reducing a VG

Although this is not very likely to happen, you may need to reduce the size of the VG: it may happen that you want to remove a PV from a VG, either temporarily or permanently. For example:

  • you may need to replace an old disk that is quite likely to break that is holding a PV
  • you may need to replace a low performance disk holding the PV with a high performance one
  • ...

Reducing a VG can be accomplished either in the following two ways:

  • removing PVs
  • shrinking the PV that are already members of the VG (see the box below)
When shrinking a VG with a single PV, you must of course shrink the PV. As previously explained, although this can be technically done, it demands a lot of attention, since you must before compact the allocated PE (see "Manually relocating PEs across the PVs" for the details) and only then can you reduce the size of the PV. Even worse, when the PV is a partition, you must drop and recreate the partition with the new shrinked size.

In order to be removed from a VG, there must not be allocated PE in the PV: this means that you must migrate all the allocated PE of the PV you want to remove to the remaining PVs of the VG: if the available free PE on the remaining PV are not enough, you must add an additional PV or extend the existing ones before proceeding.

For example, if we want to remove "/dev/sdc" PV from the "data" VG, we can get its PE consumption as follows:

pvdisplay /dev/sdc | grep PE

the output on my system is as follows:

  PE Size               32.00 MiB
  Total PE              156
  Free PE               88
  Allocated PE          68

Now let's see if  “data” VG have enough Free PEs to store the PEs consumed by "/dev/sdc" :

vgs -o vg_free_count data

the output on my system is:

  Free
  3514

Since to remove "/dev/sdc" we need at least 68 Free PE in the “data” VG, and the VG have 3514 Free PEs, we can safely remove the "/dev/sdc" PV from the VG.

Let's empty "/dev/sdc" as follows:

pvmove /dev/sdc

the output on my system is:

  /dev/sdc: Moved: 0.00%
  /dev/sdc: Moved: 10.79%
…
  /dev/sdc: Moved: 91.40%
  /dev/sdc: Moved: 100.00%

now that "/dev/sdc" has no allocated PEs, we can safely reduce the VG by removing the "/dev/sdc" PV from it:

vgreduce data /dev/sdc

it is now safe to remove LVM labels from "/dev/sdc" using pvremove as previously shown:

pvremove /dev/sdc
lvmdevices --deldev /dev/sdc

Manually relocating PEs across the PVs

The pvmove command is much more powerful than we saw so far: it is so granular that it lets you even move ranges of PE from PV to PV, or even within the same PV.
As an example, let's have a look to the current allocation of the PE on "/dev/sdb":

pvs -o +pvseg_all,lv_name /dev/sdb | awk 'NR == 1; NR > 1 {print $0 | "sort -k 9 -k 7"}'

the output on my system is as follows:

  PV         VG   Fmt  Attr PSize  PFree  Start SSize LV        
  /dev/sdb  data lvm2 a--  <9.79g <7.79g   100    21           
  /dev/sdb  data lvm2 a--  <9.79g <7.79g   512    38           
  /dev/sdb  data lvm2 a--  <9.79g <7.79g   571  1934           
  /dev/sdb  data lvm2 a--  <9.79g <7.79g     0   100 pgsql_data
  /dev/sdb  data lvm2 a--  <9.79g <7.79g   121   391 pgsql_data
  /dev/sdb  data lvm2 a--  <9.79g <7.79g   550    21 pgsql_data

this shows that we have only three range of PE already allocated to "pgsql_data" LV:

  • 0-99 (100 PE),
  • 121-511 (391 PE)
  • 550-571 (21 PE)

Since we have 21 free PE starting at 100, we can move 21 PE from 550 to fill the hole - the command is as follows:

pvmove --alloc anywhere /dev/sdb:550-570 /dev/sdb:100-120

the output on my system is:

  /dev/sdb: Moved: 4.76%
  /dev/sdb: Moved: 100.00%
Mind that we must supply the “--alloc anywhere“ option since we are moving PE within the same PV: when moving across different PV this option is not necessary.

Let's look at the outcome:

pvs -o  +pvseg_all,lv_name /dev/sdb | awk 'NR == 1; NR > 1 {print $0 | "sort -k 9 -k 7"}'

now the output on my system is:

  PV         VG   Fmt  Attr PSize  PFree  Start SSize LV        
  /dev/sdb  data lvm2 a--  <9.79g <7.79g   512  1993           
  /dev/sdb  data lvm2 a--  <9.79g <7.79g     0   512 pgsql_data

we reached our goal: the hole disappeared, and now we have all PEs from 0 to 512 allocated without holes.

Working with LV

A logical volume is the storage device that is finally provided to the system. You can think of it as a partition that can be resized at will without having to worry of where data is actually stored on the underlying hardware.

Creating LV

You can create a LV using the lvcreate command line utility. For example, to create the “foo” LV of 1GiB in size into the “data” VG type:

lvcreate -L1Gib -n foo data

if you fancy, you can even create a LV specifying the size using the number of extents:

lvcreate -l10 -n bar data

or even the percentage of remaining free space.

For example, to create the "baz" LV into the "data" VG using the 80% of the remaining available space of the VG, type:

lvcreate -l5%FREE -n baz data

If you need higher performances, you can even configure LVM to stripe writes across the PVs of the VG, providing with the "-i" parameter the number of PV you ant to stripe onto

For example, if you want to have writes striped across two different PVs, :

lvcreate -L512M -i2 -n qux data

the output is as follows:

 Using default stripesize 64.00 KiB.
Logical volume "qux" created.

Note how this time it prints information about the default stripe size.

As you can easily guess, the striping setting are viewable also from the device mapper:

dmsetup deps /dev/mapper/data-qux

the output on my system is:

2 dependencies	: (8, 48) (8, 16)

we can of course get information on the striping from the LVM itself as follows:

lvdisplay data/qux -m

the output on my system is:

  --- Logical volume ---
  LV Path                /dev/data/qux
  LV Name                qux
  VG Name                data
  LV UUID                ZacQJN-GFJy-xsTv-4teq-0lM3-EUBm-l8xXgw
  LV Write Access        read/write
  LV Creation host, time localhost.localdomain, 2022-11-08 23:34:17 +0100
  LV Status              available
  # open                 0
  LV Size                512.00 MiB
  Current LE             128
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     8192
  Block device           253:3
   
  --- Segments ---
  Logical extents 0 to 127:
    Type		striped
    Stripes		2
    Stripe size		64.00 KiB
    Stripe 0:
      Physical volume	/dev/sdb
      Physical extents	636 to 699
    Stripe 1:
      Physical volume	/dev/sdd
      Physical extents	0 to 63

As you can see, it confirms that the stripe size is 64K – that is the default value.
You can obviously set the stripe size to a custom value more appropriate to your workload providing it with the "-I" option.
For example, to set a stripe of 256K on 3 PVs, type:

lvcreate -L512M -i3 -I 128 -n waldo data

let's check the outcome:

lvdisplay data/waldo -m

the output on my system is:

  --- Logical volume ---
  LV Path                /dev/data/waldo
  LV Name                waldo
  VG Name                data
  LV UUID                2pHhoL-3142-f3jf-nNCL-s3DP-SdKH-udZI3I
  LV Write Access        read/write
  LV Creation host, time localhost.localdomain, 2022-11-08 23:40:16 +0100
  LV Status              available
  # open                 0
  LV Size                516.00 MiB
  Current LE             129
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     8192
  Block device           253:4
   
  --- Segments ---
  Logical extents 0 to 128:
    Type		striped
    Stripes		3
    Stripe size		128.00 KiB
    Stripe 0:
      Physical volume	/dev/sdb
      Physical extents	700 to 742
    Stripe 1:
      Physical volume	/dev/sdd
      Physical extents	64 to 106
    Stripe 2:
      Physical volume	/dev/sde
      Physical extents	0 to 42

Renaming a LV

You may of course need to rename an existing LV - this can be accomplished using the lvrename command line utility. For example, to rename the "waldo" LV of the "data" VG into "pgsql_data" type:

lvrename data waldo pgsql_data

Resizing LV

Once created a LV, sooner or later you may need to resize it. This can be accomplished by the lvresize command. For example, to resize "pgsql_data" LV to fill all the remaining available space of the "data" VG:

lvresize -l100%FREE data/pgsql_data
Mind that lvresize supports the same options used by lvcreate to supply the size.

LV shrinking is supported, but it requires you to confirm that you know what you are doing.
For example:

lvresize -L512MiB data/foo

produces the following warning:

  THIS MAY DESTROY YOUR DATA (filesystem etc.)
Do you really want to reduce data/foo? [y/n]:

This is just to remind you that what you are trying to do, although supported, can be dangerous.

Mind that not all the filesystems support shrinking, and anyway you cannot shrink more than the actual allocated space.

Removing LV

You can get rid of LVs using the lvremove command line utility.

For example, to remove "baz" LV from the "data" VG just type:

lvremove data/pgsql_data

Since this is a potentially harmful operation,  it asks you to confirm what you are doing.

Remember that you cannot remove a LV with a mounted filesystem: you should unmount it first.

Thin-Pool Backed LVs

Modern storage systems enable you to over-provision storage, speculating that users will not really need to use the whole provisioned space soon. LVM enables you to do this by the means of thin-pools: a thin-pool lets you provision to LVs more space than the one that is actually available to the VG - with thin provisioning LE are only reserved to the LV, and are allocated only when it is really needed to store some data. This way you can reserve a LV more LE than the PE that has actually been assigned to the VG.

Storage overcommitting is a feature of most of the enterprise class storage hardware, or hypervisors, so you do not need this if your storage hardware or hypervisor already provides thin-provisioning.

This feature is really useful when dealing with bare metal used to provide shared storage (NFS, CIFS, iSCSI) using low-cost commodity storage that does not natively support storage overcommitting.

An example scenario is Gluster-FS: it is a shared distributed replicated file system that has been designed to be installed on commodity hardware with directly attached storage: since the hardware must be as simple as possible (and so does not support storage over-provisioning), Gluster relies on LVM thins pools to implement it.

I worked with Gluster in the past, and in my experience it is suitable for use-case when you have to deal with a quite small amount of very large files – for example as a store of VM data disks. In my past experience when dealing with a lot of small files performances were really poor.

The thin pool is itself an LV specifically created for this purpose: it is used as the source of the LEs to be assigned to Thin-LVs. As an example, consider the following scenario:

  VG   #PV #LV #SN Attr   VSize  VFree 
  data   4   0   0 wz--n- 39.98g 39.98g

Creating a Thin-Pool LV

Let's now create the "fs_pool" LV of kind thin-pool that uses 60% of the free space of the VG into the already existent "data":

lvcreate -l60%FREE --thinpool fs_pool data

on my system the output is as follows:

  Thin pool volume with chunk size 64.00 KiB can address at most <15.88 TiB of data.
  Logical volume "fs_pool" created.
You cannot create an LV of kind thin-pool greater of the VG itself.

Let's check the outcome:

lvs

on my system the output is:

  LV        VG   Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  bar       data -wi-a-----  40.00m                                                    
  foo       data -wi-a-----   1.00g                                                    
  qux       data -wi-a----- 512.00m                                                    
  fs_pool   data twi-a-tz--  23.94g              0.00   10.53 

an LV of kind thin-pool is quite special, since it leverages on two hidden LV: we can see them providing the "-a" option of the lvs command line utility so to list all the available LVs, including the hidden ones:

lvs -a

on my system the output is as follows:

  LV                VG   Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  bar               data -wi-a-----  40.00m                                                    
  foo               data -wi-a-----   1.00g                                                    
  [lvol0_pmspare]   data ewi-------  24.00m                                                    
  qux               data -wi-a----- 512.00m                                                    
  fs_pool           data twi-a-tz--  23.94g              0.00   10.53                           
  [fs_pool_tdata]   data Twi-ao----  23.94g                                                    
  [fs_pool_tmeta]   data ewi-ao----  24.00m
  • "[fs_pool_tdata]" is the LV containing the actual data
  • "[fs_pool_tmeta]" holds the metadata

The metadata LV can be from 2MiB and 16GiB in size – you can manually specify its size using "--poolmetadatasize" command option.

As an example, let's create another thin-pool specifying the size of the metadata LV:

lvcreate -l50%FREE --poolmetadatasize 3MiB --thinpool goldens_pool data

let's check the outcome:

lvs -a

on my system the output is as follows:

  LV                   VG   Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  bar                  data -wi-a-----  40.00m                                                    
  foo                  data -wi-a-----   1.00g                                                    
  [lvol0_pmspare]      data ewi-------  24.00m                                                    
  qux                  data -wi-a----- 512.00m                                                    
  fs_pool              data twi-a-tz--  23.94g              0.00   10.53                           
  [fs_pool_tdata]      data Twi-ao----  23.94g                                                    
  [fs_pool_tmeta]      data ewi-ao----  24.00m
  goldens_pool         data twi-a-tz--  <7.22g              0.00   11.13                           
  [goldens_pool_tdata] data Twi-ao----  <7.22g                                                         
  [goldens_pool_tmeta] data ewi-ao----   4.00m 

As you see "goldens_pool_tmeta" has be rounded to 4MiB: this is because the PE size of the VG is 4MiB.

Creating a Thin-Pool backed LV

Now , to test the ''fs_pool" thin-pool, let's try to create a thin provisioned  LV with a size bigger than the VG itself: 

lvcreate -V 15G --thin -n shared_fs data/fs_pool

Please note the usage of the "--thin" option and that this time we specify an LV as a container rather than a VG.

Let see the outcome:

lvs

the output is as follows:

  LV            VG   Attr       LSize   Pool   Origin  Data%  Meta%  Move Log Cpy%Sync Convert
  bar           data -wi-a-----  40.00m                                                      
  foo           data -wi-a-----   1.00g                                                      
  fs_pool       data twi-aotz--  23.94g                0.00   10.55                           
  qux           data -wi-a----- 512.00m                                                      
  shared_fs     data Vwi-a-tz--  15.00g fs_pool        0.00                                   
  goldens_pool  data twi-a-tz--  <7.22g                0.00   11.13

As you can see, we have provisioned 15G to the "baz" LV despite the "data" VG being only around 10GiB in size.

Time after time, you can check the percentage usage of the LVM Data and Metadata simply by typing "lvs" as follows:

  LV            VG   Attr       LSize   Pool   Origin  Data%  Meta%  Move Log Cpy%Sync Convert
  bar           data -wi-a-----  40.00m                                                      
  foo           data -wi-a-----   1.00g                                                      
  fs_pool       data twi-aotz--  23.94g                39.05   26.30                           
  qux           data -wi-a----- 512.00m                                                      
  shared_fs     data Vwi-a-tz--  15.00g fs_pool        5.14                                   
  goldens_pool  data twi-a-tz--  <7.22g                0.00   11.13

Resizing a Thin-Pool LV

Since the thinpool is actually an LV, if you run out of space you can extend it the same way you would do with a regular LV:

lvresize -L +512MiB data/fs_pool

Of course it may also happen that you run out of space in the metadata pool: you can extend it as well.

For example, to add other 40MiB to the metadata LV of the "fs_pool" thin pool of the "data" VG, type:

lvresize -L +40MiB data/fs_pool_tmeta

Resizing a Thin-Pool backed LV

You can of course resize also any thin-pool backed LV. As an example let's resize the "shared_fs" LV we've just created adding other 40GiB:

lvresize -L+40GiB data/shared_fs

this is the output on my system:

  Size of logical volume data/shared_fs changed from 15.00 GiB (3840 extents) to 55.00 GiB (14080 extents).
  WARNING: Sum of all thin volume sizes (55.00 GiB) exceeds the size of thin pools and the size of whole volume group (39.98 GiB).
  WARNING: You have not turned on protection against thin pools running out of space.
  WARNING: Set activation/thin_pool_autoextend_threshold below 100 to trigger automatic extension of thin pools before they get full.
  Logical volume data/shared_fs successfully resized.

Since I'm over-provisioning the storage, LVM emitted the above warnings. 

Auto-extending Thin-Pools LV

As suggested by the above warnings, if you fancy you can enable the automatic extension of thin pools: this is a handy feature that automatically extends the backing thin-pools LVs when needed.

To configure it, look into the "/etc/lvm/lvm.conf" file and uncomment and modify the following settings as needed:

thin_pool_autoextend_threshold = 100
thin_pool_autoextend_percent = 20

since the default threshold is 100% it never gets hit.

To enable it you must lower it: for example, if you set the threshold to 80 as soon as any data thin-pool gets filled to 80% it gets automatically extended by 20%.

Snapshots

LVM snapshots leverage on Copy-on-Write (CoW) to provide a consistent view of the data at a given moment. They are really useful for example when dealing with backups, to ensure data consistency during the backup. Another typical use-case is having a small time-window to perform actions that modify the file system having the opportunity to rollback if necessary.

Rolling-back a snapshot causes a service outage: the filesystem contained in the LV you want to restore should be unmounted indeed before being able to proceed with the rollback. Take this in account if you decide to use it for this kind of use case.

Mind that LVM snapshots cannot be used for long-term storage: each time something gets modified its original data are copied to the designated LV that eventually runs out of space. In addition to that take in account that CoW does not come for free – it has a performance cost, since you are basically reading old data from the original LV, writing them to the snapshot LV and eventually you are writing the new data to the original LV. For these reasons the life-time of a snapshot should be as short as possible, and not last more than the purpose it has taken for: if we took it for running a backup, we should remove it as soon as the backup ends.

Since snapshots are LV, you can grow or shrink them using lresize.

Taking a snapshot

A snapshot is taken using lvcreate command providing the "-s" option and the path to the LV that is subject of the snapshot.

For example, to take a snapshot called "foo_snap" of "foo" LV belonging to "data" VG:

lvcreate -L 512M -s -n qux_snap data/qux

when creating a snapshot LV it is mandatory to specify the size using the -L parameter. The outcome is a new LV used to store the COW data. We can take the snapshot also for a thin-pool backed LV. For example, to take a snapshot of the "shared_fs" LV we previously created:

lvcreate -L 512M -s -n shared_fs_snap data/shared_fs

Anyway, when dealing with a thin-pool backed LV, you can also omit to specify the size, so creating also the snapshot LV itself as thin-pool backed LV.

lvcreate -s --name thin_snap data/shared_fs

since snapshots are actually LV, you can list them using the lvs as usual:

lvs

on my system the output is as follows:

  LV                 VG   Attr       LSize   Pool    Origin    Data%  Meta%  Move Log Cpy%Sync Convert
  bar                data -wi-a-----  40.00m                                                         
  foo                data -wi-a-----   1.00g                                                         
  fs_pool            data twi-aotz-- <24.19g                   0.00   10.12                           
  qux                data -wi-a----- 512.00m
  qux_snap           data swi-a-s--- 512.00m         qux       0.00                                                                                               
  shared_fs          data owi-a-tz--  15.00g fs_pool           0.00                                   
  shared_fs_snap     data swi-a-s--- 512.00m         shared_fs 0.00                                                           
  thin_snap          data Vwi---tz-k  15.00g fs_pool shared_fs
  goldens_pool       data twi-a-tz--  <7.22g                   0.00   11.13

As you can see you can easily guess which LV are snapshots: they have an Origin.

In the above list:

  • the "qux_snap" LV is a snapshot taken from the "qux" LV.
  • the "shared_fs_snap" LV is a snapshot taken from the "shared_fs"  thin-pool backed LV ("fs_pool").
  • the "thin_snap" thin-pool backed LV ("fs_pool") is a snapshot taken from the "shared_fs"  thin-pool backed LV.

Now remove the "thin_snap" LV  as follows:

lvremove data/thin_snap

we can verify the consumption of the snapshot using the lvs command as usual:

lvs

As soon as data in the "shared_fs" LV is modified, it is copied to the "shared_fs_snap" increasing the value of the "Data%".

  LV                 VG   Attr       LSize   Pool   Origin    Data%  Meta%  Move Log Cpy%Sync Convert
  bar                data -wi-a-----  40.00m                                                         
  foo                data -wi-a-----   1.00g                                                         
  fs_pool            data twi-aotz-- <24.19g                   0.00   10.12                           
  qux                data -wi-a----- 512.00m
  qux_snap           data swi-a-s--- 512.00m         qux       0.00                                                                                               
  shared_fs          data owi-a-tz--  15.00g fs_pool           0.00                                   
  shared_fs_snap     data swi-a-s--- 512.00m         shared_fs 3.42
  goldens_pool       data twi-a-tz--  <7.22g                   0.00   11.13

As you can easily guess, when it reaches 100% meaning there's no more space to copy data from the Origin to the snapshot, and so the snapshot has become unusable:

When the backing LV becomes full the snapshot gets invalidated. This means that you won't be able to use it anymore - if you try to rollback it you get the following message:

Unable to merge invalidated snapshot LV data/foo_snap.

Auto-extending Snapshots

Same way as with thin-pools, a very handy feature you can enable if necessary is auto-extend the backing LV used by snapshots: look into "/etc/lvm/lvm.conf" and locate the following settings:

snapshot_autoextend_threshold = 100
snapshot_autoextend_percent = 20

Even here, since the default threshold is 100% it never gets hit.

To enable the auto-extend feature, you need to lower the threshold: for example if you set "snapshot_autoextend_threshold" to 80 as soon as it gets filled to 80% it gets extended by 20%.

Rollback a snapshot

Rolling back a snapshot requires as first step to unmount the filesystem of the LV we want to restore: for example, if the "shared_fs" LV is mounted on "/srv/nfs/shared_fs", type:

umount /srv/nfs/shared_fs

once unmounted we can safely proceed with the rollback statement:

lvconvert --merge data/shared_fs_snap

the output is as follows:

  Merging of volume data/shared_fs_snap started.
  data/foo: Merged: 95.39%
  data/foo: Merged: 100.00%

Mind that when the rollback completes - that is after merging a backing LV into its Origin, the backing LV (shared_fs_snap) gets automatically removed.

Removing a snapshot

Since the backing LV of the snapshot is actually an LV, you can remove it as any regular LV as follows:

lvremove data/shared_fs_snap

External-Origin snapshots

External-Origin snapshots are a snapshots that works the opposite way of regular ones: when using regular snapshots, you mount the snapshotted (so the Origin) LV, so when writing you are actually modifying its data and the original data are copied on write (COW) to the backing LV.

External-Origin snapshots instead are set-up by making the Origin LV Ready Only, so modified data are directly written to the backing LV. The most straightforward consequence of this is that you can take multiple snapshots of the same origin, and since these snapshots are from a read only LV they always start with the state the Origin was in when it has been set read-only.

Conversely from Same-Origin snapshots, external origin snapshots are long lasting snapshots meaning that you can always rollback to their original state: with this kind of snapshot you have the opposite trouble: you must provide enough space to the backing LV to avoid it to run out of it.

A typical use of this kind of snapshot is creating a golden image used as the base to span several instances: the golden must be immutable – this is why the origin LV is set read-only, whereas each instance gets a read-write backing LV it can consume as LVM overlay. It is straightforward that you cannot merge the backing LV into the Origin LV, otherwise you would invalidate any other snapshot relying on the same origin.

If you try to create an external origin snapshot without previously having set the LV inactive and read-only, then you'll get the following error:

Cannot use writable LV as the external origin.

For example, if the golden is the "base" LV of the "data" VG, set it inactive and read only as follows:

lvchange -a n -p r data/base

You can now take as many snapshot as you want: 

lvcreate -s --thinpool data/goldens_pool base --name base_clone_1
lvcreate -s --thinpool data/goldens_pool base --name base_clone_2

Mind that when dealign with thin provisioned LV the commands are exactly the same.

Let's check the outcome:

lvs

the output on my system is as follows:

  LV             VG   Attr       LSize   Pool         Origin Data%  Meta%  Move Log Cpy%Sync Convert
  bar            data -wi-a-----  40.00m                                                        
  foo            data -wi-a-----   1.00g                                                        
  fs_pool        data twi-aotz-- <24.19g                      0.00   10.12                           
  base           data ori------- 512.00m                                                        
  base_clone_1   data Vwi-a-tz-- 512.00m goldesn_pool base    0.00                                   
  base_clone_2   data Vwi-a-tz-- 512.00m goldens_pool base    0.00                                   
  shared_fs      data Vwi-a-tz--  55.00g fs_pool              0.00                                   
  goldens_pool   data twi-aotz--  <6.76g                      0.00   10.79  

The "base_clone_1" and "base_clone2" can now be independently used at wish.

Using The Logical Volume

Once created, LV are provisioned by the Device Mapper beneath "/dev" as special device files named "dm-<number>". The Device Mapper also creates a convenient symlink beneath "/dev/mapper" that contains both VG and LV in its name.

For example the "bar" LV of the "foo" VG generates the "/dev/mapper/foo-bar" symlink.

You can then rely on that symlink to format the LV.

For example:

mkfs.xfs -L sharedfs /dev/mapper/data-shared_fs

the same symlink can of course be used as the source device to mount the filesystem.

For example, type the following statements:

mkdir /srv/nfs/shared_fs
mount /dev/mapper/data-shared_fs /srv/nfs/shared_fs

If instead you need to permanently mount it, so that it gets automatically remounted when booting, add a line like the following one into the "/etc/fstab" file:

/dev/mapper/data-shared_fs   /srv/nfs/shared_fs  xfs     defaults        0 0

Footnotes

Here it ends this post about using LVM: we thoroughly saw how to professionally operate it with almost every scenario it is likely you can find. In addition to that, I'm pretty sure now you have also got the necessary skills to autonomously investigate more complex setups.

Writing a post like this takes a lot of hours. I'm doing it for the only pleasure of sharing knowledge and thoughts, but all of this does not come for free: it is a time consuming volunteering task. This blog is not affiliated to anybody, does not show advertisements nor sells data of visitors. The only goal of this blog is to make ideas flow. So please, if you liked this post, spend a little of your time to share it on Linkedin or Twitter using the buttons below: seeing that posts are actually read is the only way I have to understand if I'm really sharing thought or if I'm just wasting time and I'd better quit.

4 thoughts on “LVM Tutorial – A Thorough Howto On The Logical Volume Manager

  1. It is really great and concise tutorial, enjoyed reading it. It neatly sums up most of what one should to know about LVM. For me personally was interesting to learn about external-origin snapshots. Thank you!

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>