POSIX compliant filesystems

Posted on May 3, 2021November 14, 2024 by Marco Antonio Carcano

Don’t be tempted to skip this post: you would miss something valuable. Of course most of us know how to operate a filesystem, but the underlying details of POSIX filesystems are not broadly known by most of the people. In this post I describe them quite accurately, trying to keep at a level that may intrigue, but avoiding to be too theoretical. Having such an expertise is certainly one of the things that make the difference from a technician and a skilled professional. In addition to that, this skill may really save your life when facing weird things that sometimes may arise.

Elements of a POSIX filesystem

Directory, inodes and data blocks are the basic elements of a POSIX compliant filesystems: inodes are objects you should have already been acquainted to, so we begin our tour from them.

Inodes

An inode is a data structure that provides metadata information of an object. There are a few different types of inodes (file, directory, socket, special device); they provide information such as:

a timestamp of when the inode was last modified (the ctime )
a timestamp of when the file's contents were last modified (the mtime )
a timestamp of when file was last accessed (the atime ) for read ( ) or exec ( )
the numeric user id (UID) of the user that is the owner of the backed data
the numeric group id (GID) of the group that is the owner of the backed data
the POSIX mode bits (often called permissions or permission bits )
...

Inode's metadata also provide information to locate the initial data blocks belonging to the file. Note that each inode is an entry of the inode table: since the size of each entry is fixed, the inode table is pre-allocated at format time. This means that, once a filesystem has been formatted, the inode table gets formatted to store a given number of entries aka inodes. This is actually the upper limit of the objects you can create into the POSIX filesystem, whether they are files or directories.
Let's see this in action: by using the stat command we can get information of the inode linked to the directory entry called "bashrc" of the "/etc" directory:

cd /etc
stat bashrc

the outcome is:

  File: bashrc
  Size: 3001      	Blocks: 8          IO Block: 4096   regular file
Device: 801h/2049d	Inode: 157687      Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Context: system_u:object_r:etc_t:s0
Access: 2020-03-21 13:14:18.597717832 +0000
Modify: 2018-09-10 11:51:03.000000000 +0000
Change: 2019-10-25 18:37:43.962000000 +0000
 Birth: -

the previous output shows that it is the inode of a regular file with inode number 157687.

Despite the previous output showing the full path to the file, it is shown there only for the convenience of the user, but this information is not stored among the inode data.

File

It is a set of data blocks used to permanently store contents, such as documents, images and so on. Note that under the hood, a file does not have a name: it's only a logical object made of a set of blocks that can be either directly referenced from the inode or indirectly referenced from another data block. The name of the file is instead stored inside a directory as a directory entry.

File size and size on disk

We can now start talking about the size of files. They actually have two sizes:

the length of the data in the file (let's call it simply “size”)
the actual size used on disk

Most of the time they differs: let's look for example at the sizes of /etc/passwd file (-l shows the size, -s the size on disk, -h to print it human-readable)

ls -lsh passwd
4.0K -rw-r--r--. 1 root root 1.8K Mar 11 20:39 passwd

the actual length of the data in the file is 1.8 KB, that is how far you have to seek from the Beginning Of File (BOF) to get to the End Of File (EOF). It is right to think of it as the actual data size, whereas 4KB is the actual size used on disk.

This latter information is the same reported by du utility by the way:

du -sh passwd
4.0K	passwd

file size and actual size on disk differs because the quantum to allocate disk space is the block, and in this example the underlying filesystem has been created with 4KB block size - on XFS filesystems we can verify this as follows:

xfs_info / | grep  "^data[ ]*=[ ]*" | sed 's/.*\(bsize=[0-9]*\).*/\1/'
bsize=4096

So in this example the file owns a whole block of 4KB of data, but the length of the data from BOF to EOF is only 1.8KB.

You may be tempted to appraise that the file needed size is always lesser or equal to the size on disk (that is the sum of each allocated block): this is so most of the time, but it is not always true.

Sparse files

If the underlying filesystem supports them (as most of the Unix file systems do), sparse files are this exception: they differs from the "regular" ones in the way underlying blocks on the disk are allocated - a sparse file declares a given length from BOF to EOF (so declares a data size) but sets blocks references to null.
We can see this in action by using dd command line utility as follows:

dd if=/dev/zero of=/tmp/disk.img bs=1 count=0 seek=1G

it creates an empty file of 1GB (seek=1GB means 1GB in size, that is 1GB from BOF to EOF), but without writing any block (count=0). That is no block for all over the length, so null blocks.

If the underlying filesystem has sparse files support, it does not allocates the backing disk blocks, leaving the file at 0 disk size:

ls -lsh /tmp/disk.img
0 -rw-r--r--. 1 macarcano macarcano 1.0G Mar 21 14:12 /tmp/disk.img

du reports the same information:

du -sh  /tmp/disk.img 
0	/tmp/disk.img

disk blocks are actually allocated only when needed, when something is actually writing into the file.

Now let's remove the sparse file and go on:

rm -f /tmp/disk.img

Let's create a smaller sparse file to play with - this time we create it using truncate utility:

cd /tmp
mkdir sparse
truncate -s 16M sparse/disk.img

now let's format it using a data block size smaller than the default one and eventually mount it:

mkfs.xfs -b size=1024 sparse/disk.img
sudo mount -o loop sparse/disk.img /mnt
sudo chmod 777 /mnt

now let's store a 12 characters string into a file:

echo "Hello world" > /mnt/foo

let's check the sizes:

ls -lsh /mnt/foo
1.0K -rw-rw-r--. 1 macarcano macarcano 12 Mar 21 14:28 /mnt/foo

everything as expected: size is 12 Bytes, whereas size on disk is 1K, that matches the block size we just picked up at formatting time - indeed:

xfs_info /mnt | grep  "^data[ ]*=[ ]*" | sed 's/.*\(bsize=[0-9]*\).*/\1/'
bsize=1024

Now create a 2M file of random data inside it and unmount the filesystem

dd if=/dev/urandom of=/mnt/bar.dat bs=1M count=2
sudo umount /mnt

Note how ls displays both the size on disk (5.7M) and declared size (16M):

ls -lsh sparse/disk.img
5.7M -rw-rw-r--. 1 macarcano macarcano 16M Mar 21 14:30 sparse/disk.img

Sparse files are generally used to store loop filesystems. Many virtualization solutions rely on sparse files to create VM thin-provisioned disks: their usage should be carefully evaluated, since they obviously tend to fragmentation.

You can move sparse file from a location to another one within the same filesystem using mv: since mv operates on on the same filesystem, it does not touch inodes, and so it does not alter the file size - it is simply removing the directory entry from the origin and adding a new entry to the destination directory.

ls -i sparse
18636774 disk.img
mv sparse/disk.img sparse/disk2.img
ls -i sparse
18636774 disk2.img

The copy (cp) command supports spare files but you should supply --sparse option, for example:

mkdir backup
cp --sparse=always sparse/disk2.img backup/disk3.img
ls -lsh backup/disk3.img 
4.1M -rw-rw-r--. 1 macarcano macarcano 16M Mar 21 15:20 backup/disk3.img

note how the actual size changed (4.1M) , but is lower than the declared size (16M).
Since we are human being, we may copy a sparse file like a regular one by mistake, omitting the proper options not to allocate backing blocks for unused data:

 cp sparse/disk2.img backup/disk4.img
ls -lsh backup/disk4.img 
16M -rw-rw-r--. 1 macarcano macarcano 16M Mar 21 15:21 backup/disk4.img

we can easily recover an error like this using fallocate command utility on the destination file using --dig-holes option:

fallocate --dig-holes  backup/disk4.img

let's check the size again: it should be the size of the original sparse file

ls -lsh backup/disk4.img
4.2M -rw-rw-r--. 1 macarcano macarcano 16M Mar 21 15:20 backup/disk4.img

When dealing with a lot of files or remote backups, you can keep things in synch by using rsync with -S (sparse) option:

rsync -S backup/disk4.img backupuser@remote/home/macarcano

Let's check the file sizes on the remote server:

ls -lash /home/macarcano/disk4.img 
8.0M -rw-rw-r--. 1 backupuser backupuser 16M Mar 21 15:23 disk4.img

as you see, the archived file is still a sparse file, although it consumes more space: 8.0M, whereas the original one was of only 4.2M.

Sparse file support in rsynch seems not very performant, at least by the space consumption perspective.

Orphaned inodes

Orphaned inodes are inodes that are still marked as used and bound to data blocks on the disk, but with no more directory entries linked to them.

On the EXT filesystem the fsck utility recovers them and puts them into the lost+found directory. But there are situations when you cannot shut down a system neither unmount the filesystem to run fsck so to perform the recovery.

Let's see a real life example that happened years ago: on a system that was running out of disk space somebody archived (copied and deleted) a log file while a service was still writing onto it. The outcome was that the inode of the file became orphaned: so still there (consuming disk space), still receiving log messages, but even worse, still growing. Reloading the service would have led to losing all the messages logged by the service after having copied the file - this was unacceptable. Eventually I got the idea to try to cat to a new archive file from the file descriptor that was still open, then truncate the file through the file descriptor and reload the service. It worked.
As an example, let say there is a running nginx service instance that among other files is logging into /var/log/nginx/error.log, and that this log file has been deleted: we can use the lsof utility and search for open files marked as deleted:

lsof | grep nginx | grep "(deleted)"

the output should look like as follows:

nginx     1121                    root    2w      REG                8,1        0   25527982 /var/log/nginx/error.log (deleted)

Please note the PID 1121, and the file descriptor 8. Now, by exploiting the /proc filesystem we can cat from the open file descriptor into /tmp/archive.tgz, truncate the file descriptor and reload

It is like doing this in the above example:

cat /proc/1121/fd/8 | tar cfvz /tmp/archive.tgz
truncate -s0 /proc/1121/fd/8
systemctl reload nginx

As Pedro pointed out in comments "This is Unix classic behavior in action. You can keep writing on file descriptor after the unlink(2) ... Take a look at the old classic book by Maurice Bach 'The Unix Operating System".

Special File types available in POSIX filesystems

Inodes can also be used also to backup information of object that are not real files or directories, providing the user the feel that those objects are files: very often I hear the famous sentence "on UNIX everything is a file": this is not the right way of saying it: the truth is that "on UNIX almost everything is available as a file". We are about to discover why.

Device Files

These are special files that provide direct access to devices, such as disks, the keyboard, the mice and so on. There are two kind of device files:

device-type

kind

description

character device

a special type of file that provides direct access to character based devices such as keyboards. An example character device is the teletype (tty)

block device

a special type of file that provides direct access to block based devices such as disks. An example character device is the SCSI disk (sd) or IDE disk (hd)

Device files are created using the mknod utility: the syntax is as follows:

mknod device-name device-type major-number minor-number

major number identifies the device driver used to access the device
minor number is specific to the device driver implementation

for example, to create a block device file:

sudo mknod /dev/loop1 b 7 0

to create a character device file:

sudo mknod /dev/another-random c 1 8

Note that since both of the numbers are stored using a single byte, the range is 0-254. The list of major and minor numbers used by the Linux kernel is available at https://www.kernel.org/doc/Documentation/admin-guide/devices.txt

Because of such a small range, as you can easily guess the most if not all of the major numbers have already been reserved. If you are developing your own device driver and you need a major number, you can specify it as 0: this causes the utility to find a major number that is still unused on your system.

named pipe

Pipes are a one way only channel used by processes to exchange data. They are part of the Inter Process Communication subsystem. Conversely from traditional pipes that do not have a name, named pipes are special files a process can attach to send streams to and read streams from: it is a convenient way to let a process to stream data to another one.

As Pedro pointed out in comments "Unnamed pipes (ls | more) allocate an inode in it short lifespan".

Named pipes can be either created by mknod or mkfifo command line utilities.

For example, the statement:

mkfifo -m 0666 foopipe

creates a file called foopipe into the current directory. We can then see the outcome:

ll foopipe

the output is as follows:

prw-r—r-- 1  macarcano  macarcano  0  Sep  15  12:21   foopipe

An example usage case are backups: years ago I often exploited named pipes to perform backup of databases avoiding to have a dedicated disk to temporary store the backup files before the backup utility moves them to tapes: for example I had mysqldump to backup to the named pipe, and Bacula agent reading from that pipe.

UNIX Domain Socket

It is an IPC facility that provides a data communications endpoint for exchanging data between processes. There are three kind of UNIX domain sockets:

unnamed - these sockets are created to exchange data, but does not have a name
abstract - these sockets have a name that should be unique inside an abstract namespace. Utilities such as lsof and fuser displays this kind of sockets with a name prepended by the at (@) character
pathname - these sockets are connected to an inode and so are exposed through the filesystem

Most of the readers probably already know these latter. I won't' tell anything about the first two to avoid going off-topic, but for the sake of completeness I wanted to mention them.

Let's create a pathname UNIX domain socket to play with - we need to install netcat first:

sudo dnf install -y netcat

let's run the first netcat instance, that acts like a server:

sudo nc -l -N -U /var/run/foo.socket

the above statement launches netcat in listening mode on /var/run/foo.socket. Open another terminal and spawn a client connected to the same socket:

sudo nc -N -U /var/run/foo.socket

on the terminal running the client, type "Hello, this is the client" and hit enter: you'll get it echoed on the terminal running the server.

Hello, this is the client

Now, on the terminal running the server, type "Hi dear, nice to meet you" and hit enter: it will be echoed on the client terminal.

Hi dear, nice to meet you

On the terminal running the server, press CTRL+C to terminate it: the client will be immediately disconnected.

Note that the UNIX domain socket file opened by netcat does not get removed:

ls -al /var/run/foo.socket
srwxr-xr-x. 1 root root 0 May  3 19:07 /var/run/foo.socket

It should be explicitly unlinked.

sudo rm /var/run/foo.socket

Directory

A directory is a special object (each filesystem implementation has its own data model, that spans from a simple structure of rows to B+Tree).

What is common among all filesystem implementations is that

each entry should at least have a logical name along with the number of the inode that contains the metadata backing up the entry
the name can be any string made of any characters with the exception of "/" character and of the "null" character (usually a zero byte) - of course the length of the string is limited too, but the actual length depends form the filesystem implementation
the name is the primary key – that is you cannot have two entries with the same path

The fact that the name is the primary key is really easy to be verified: let's try to create a directory with the same name of an already existent file:

cd /etc
mkdir bashrc

the outcome is:

mkdir: cannot create directory 'bashrc': File exists

it obviously fails – the directory object named “etc”, direct children of the root directory (/), does already have an entry called "bashrc".

We can use some commonly used utilities to guess what is the structure of a directory under the hood.

Let's change to the root directory (/)

cd /

We can use the -i option to see the inode number of each entry of the directory, along with the -a option that show also hidden contents:

ls -ai -1

from the output we can derive a table like the following one (I picked up only a few entries, just to let you get the gist of things):

Are you enjoying these high quality free contents on a blog without annoying banners? I like doing this for free, but I also have costs so, if you like these contents and you want to help keeping this website free as it is now, please put your tip in the cup below:

Even a small contribution is always welcome!

Name

Inode

128

etc

132

swapfile

308219

Dot (.) and double dot (..) are placeholders:

dot (.) refers to the directory itself
double dot (..) refers to the parent directory object

Since we are listing the contents of the root directory, . and .. refers to the same inode number.

The other entries are the actual names assigned to the object backed by the inode number.

Let's repeat the exercise, this time with “etc” directory entry. Let's change to that directory first:

cd etc

and now, let's enumerate the contents:

ls -ai -1

the derived table, considering only a few objects is:

Name

Inode

132

128

kernel

8610758

bashrc

157687

As expected, the inode of this directory is 132, and the inode of its parent is 128. Note how the name of the directory itself ("etc") is not stored inside it: if you want to get it you should lookup the parent inode (128), read the directory object and locate the entry that has the inode 132, that actually is "etc".

This way of doing - changing to a directory and operating on the current directory is exactly the way of implementing a hierarchical filesystem as described by the earliest Thompson and Ritchie's paper . Modern implementations you are certainly already acquainted to use let us specify even full path, such as /etc/bashrc or relative path, such as ../home/macarcano.

As Pedro pointed out in comments "The internal of open/write/close on directories should always be done idempotent and synchronous because if something goes wrong you can end losing everything beneath it. A reminder fot the old SCO that did this async with a potential for catastrofy".

Now let's create a new directory into the current one ("/etc") - we have to use "sudo" since we are into a directory that can be modified only by users with administrative privileges:

sudo mkdir foo
ls -ai -1 foo

the derived table for the foo directory object is:

Name

Inode

9060508

132

please note how foo directory object does not contains its own name, and how an additional entry with the name foo has been added to /etc directory object - "| grep foo" is used to filter the output so to list only lines that contains foo:

ls -ai -1 |grep foo

the derived table entry is:

Name

Inode

foo

9060508

So inode 9060508 is referenced twice: by foo entry in “etc” directory object and by “.” entry in “foo” directory object.

Getting back to the "foo" directory object, since it is empty it has just two entries: . and ..

stat /etc/foo |grep Links |sed 's/.*\(Links: [0-9]*\)/\1/'
Links: 2

Let's create /etc/foo/bar directory :

sudo mkdir /etc/foo/bar

and check the number of links again

stat /etc/foo |grep Links |sed 's/.*\(Links: [0-9]*\)/\1/'
Links: 3

As seen before, adding files causes a new entry to be written into the target directory object, whereas it does not impact the number of links to the newly created directory object, that, since it is empty, is always 2 (. and ..).

So the rule is that the directory object has a link count (the number of entries, and of course the number of inodes they refer to) equal to the directory itself plus its parent directory plus the number of children directories if any.

Hard links

A question that should arise is: since the primary key is the name of the directory entry, is it allowed to have more than one directory entries with the same inode number? The answer is yes – you are simply assigning different names to the same inode number (and so to a file or a directory object backed by that inode number). This is called hard linking.

Let's try it: create a hard link called /home/macarcano/bashrc to /etc/bashrc file (sudo is required due to permissions of /etc/bashrc file)

sudo ln /etc/bashrc /home/macarcano/bashrc
stat /etc/bashrc |grep Links |sed 's/.*\(Links: [0-9]*\)/\1/'
Links: 2

note how links to /etc/bashrc have grown to 2. Let's check the inode number of both:

ls -i -1 /etc/bashrc /home/macarcano/bashrc
157687 /etc/bashrc
157687 /home/macarcano/bashrc

it is straightforward that both /etc/bashrc and /home/macarcano/bashrc are pointing to the same inode (157687), that is both directory objects "etc" and "macarcano" contain an entry named "bashrc" linked to the inode 157687.
Another thing I want to remark is that POSIX permissions are part of the inode: this means that changing permissions to /home/macarcano/bashrc means changing permissions to 157687 inode:

ls -l /etc/bashrc /home/macarcano/bashrc 
-rw-r--r--. 2 root root 3001 Sep 10  2018 /etc/bashrc
-rw-r--r--. 2 root root 3001 Sep 10  2018 /home/macarcano/bashrc
sudo chmod 666 /home/macarcano/bashrc
ls -l /etc/bashrc /home/macarcano/bashrc 
-rw-rw-rw-. 2 root root 3001 Sep 10  2018 /etc/bashrc
-rw-rw-rw-. 2 root root 3001 Sep 10  2018 /home/macarcano/bashrc

note how both /etc/bashrc and /home/macarcano/bashrc changed permissions – let's restore permission as they were before:

sudo chmod 644 /etc/bashrc

Now let's remove /home/macarcano/bashrc and check what happens to /etc/bashrc:

sudo rm /home/macarcano/bashrc
ls -i -1 /etc/bashrc
157687 /etc/bashrc
stat /etc/bashrc |grep Links |sed 's/.*\(Links: [0-9]*\)/\1/'
Links: 1

Inode 157687 is still there, along with its contents. If we would remove /etc/bashrc (don't do it), even the last directory entry in the "etc" directory object that is linked to the inode disappears – Links becomes 0 – and the inode, if it is not used by any file descriptor, is actually deleted.

As Pedro pointed out in comments "Hard links could have been patented by Thompson and Ritchie".

The next question that arises is certainly: can we create a hard link to a directory same way we did to the file?

Let's try it:

ln /etc/kernel /etc/krnl
ln: /etc/kernel: hard link not allowed for directory

OK, so we cannot.

As Pedro pointed out in comments "Hard link on directories would cause a potential for a never ending loop so
the kernel does not allow it".

Let's install strace to check the involved syscall and make some guesses:

sudo dnf install -y strace

let's run the same command wrapped by strace:

strace -e lstat ln /etc/kernel /etc/krnl
stat("/etc/kernel", {st_mode=S_IFDIR|0755, st_size=23, ...}) = 0
ln: /etc/kernel: hard link not allowed for directory
+++ exited with 1 +++

this highlights that the syscall lstat is aware of the type of the inode: this confirms what I told you before: there are several types of inodes.
So what is a hard link? Is a convenient way to add into a directory object an alias to an inode that points to a file of any kind but a directory; data will survive until there is at least one directory entry linked to the inode.

Now it should be straightforward that a hard link can only be created beneath the same filesystem: inode numbers are obviously guaranteed to be unique only inside the same filesystem.

Soft links

But what about if we do need to create an alias of a directory or between files of different filesystems? The answer is simple: we can create a soft link:

ln -s /etc/foo /tmp/bar
ls -li /etc |grep foo | awk '{print $1}'
9060508
ls -li /tmp |grep bar | awk '{print $1}'
207123

please note how this time we have two different directory entries pointing to their own inode: /etc/foo is linked to inode 9060508 whereas /tmp/bar is linked to 207123. Inode 207123 is a new inode linked that describes the path to reach /etc/foo. This means that if we delete /etc/foo, /tmp/bar survives as an object, but everything get lost: /etc/foo with its data is gone and all that remains is a broken pointer:

sudo rm -rf /etc/foo
ls -li /etc |grep foo
ls -li /tmp |grep bar 
  207123 lrwxrwxrwx. 1 root    root     8 Mar 22 15:01 bar -> /etc/foo

Obviously, everything said for soft links to a directory applies the same way to soft links to files.

As you probably already guessed, you can create a symlink even to objects that do not belong to the same filesystem.

VFS - Virtual FileSystem

The Virtual FileSystem (VFS) is an abstraction layer between the application program and the filesystem implementation, aimed at providing support for many different kinds and types of filesystems (disk-based, such as ext3,xfs; network, such as NFS, and special filesystems such as "proc").
It introduces a common file model, specifically geared toward Unix filesystems, to represent all supported file systems. Non-UNIX filesystems must map their own concepts into the common file model - For example, FAT file systems do not have inodes.
As each file system gets initialized (it can either be statically compiled or loaded as a module), it registers itself within the VFS abstraction layer.

From a bird view, the main components of the VFS common file model are:

file

information about an open - it is the logical component file descriptors are bound to

dentry

information about a directory item

inode

the metadata of a filesystem objects

superblock

it is a component so important that it is stored in multiple copies. Besides containing a pointer to the first VFS inode (inode 2, since 0 and 1 are reserved) on the file system, each VFS superblock also contains information and pointers to routines that perform particular functions.

For example, considering a mounted EXT2 filesystem, each superblock contain:

a pointer to the EXT2 specific inode reading routine;
pointers to system routines that are called when the system requires to access directories and files: these routines traverse the VFS inodes in the system, store results into the inode_cache, so to speed up subsequent access to the same inodes.

Whenever a process opens a file, it gets a file descriptor that is bound to the file component: the information comes from the inode linked to the directory entry (dentry) relevant to that particular file. Since access to these kind of information can happen quite often, VFS maintains a few caches:

Directory Cache

it is used to speed up access to commonly used directories to quickly fetch the requested entries.

Inode Cache

it speeds up access to all of the mounted file systems, avoiding access to the physical device. Each time a cached inode is accessed, the VFS increments the access count, in order to easily identify the most frequently accessed cached objects and preserve them, rather than discard rarely accessed ones to make room for caching other inodes.

Buffer Cache

Linux maintains a cache of block buffers: any block buffer that has been used to read data from a block device or to write data to it goes into this cache, that is shared between all of the physical block devices; Over time, if it is frequently accessed it remain into the cache, otherwise, it may be removed make room to more deserving data. Currently supported buffer sizes are 512, 1024, 2048, 4096 and 8192 Kb. Buffers are kept in sync with the underlying storage by pdflush kernel threads (bdflush on very old kernels).

Let's have a deeper look at the superblock.

Superblock

Roughly put, the superblock can be seen as "the metadata of the whole filesystem". Its format is tightly bound to the filesystem type, but usually it contains information such as:

filesystem type
block size
number of the inode of the file containing the root directory
the number of free blocks
the location of the root directory

and so on. As you can easily guess, all the information required to mount the filesystem are fetched from the superblock, so if it is damaged, the filesystem cannot be mounted anymore.

For this reason, almost every filesystem implementation usually keeps multiple copies of the superblock.

For example:

extended filesystem (ext2, ext3, ext4): initially used to put a copy of the superblock at the beginning of any block. Then, to reduce wasted resources, they decided to put copies only at the beginning of block 0, 1 and powers of 3,5 and 7 (sparse_super feature). Now a days superblock backups are located at the beginning of block group #1 and in the last block group (sparse super 2). The copy at the beginning of block 0 is called primary superblock.
XFS: v5 uses the first 512 bytes at the beginning of each Allocation Group.

As Pedro pointed out in comments "Ext file system is a heritage from the BSD Fast File System.
Superblock label and block groups (aligned to cylinders groups) are original
from the BSD FFS".

Recovering from a corrupted superblock

The superblock is read at mount time: if it's damaged, and so invalid, you'll get a "corrupted superblock" message.

If this happens with a filesystem of the EXT family, you can get information on superblock backup using dump2fs utility as by the following example:

sudo dumpe2fs /dev/sda1 | grep -i superblock

that checks for backups of the superblock of the filesystem stored into /dev/sda1disk partition.

The output looks as follows:

dumpe2fs 1.44.3 (10-July-2018)
  Primary superblock at 1, Group descriptors at 2-2
  Backup superblock at 8193, Group descriptors at 8194-8194

from the output we see that there's a backup of the superblock at 8193: this means that we can use it to recover the filesystem if necessary; the command to use would be:

sudo e2fsck -b 8193 /dev/sda1

As Pedro pointed out in comments "On critical file systems keep a backup info where are the the backup superblocks
dumpe2fs /dev/sda1 | grep -i superblock > somewhere.txt".

Life with XFS is even easier: xfs_repair automatically checks the primary and backups superblocks and restores a corrupted primary superblock if necessary.

However, for the sake of completeness, let's have a look at the superblock of the XFS filesystem too: as previously told, when using XFS, copies of the superblock are stored at the beginning of each Allocation Group.

In the following example we are launching xfs_db utility to check the XFS filesystem stored into /dev/sda1 partition: type the following statement

xfs_db /dev/sda1

now type sb to focus on the primary superblock, or sb followed by the number of the Allocation Group to use a backup copy of the superblock.

For example, to select the superblock in Allocation Group 1, type:

xfs_db> sb

you can now print superblock information as follows:

xfs_db> p

the following box contains a snippet with some sample output:

magicnum = 0x58465342
blocksize = 4096
dblocks = 5120
rblocks = 0
rextents = 0
uuid = 5c836a30-db4e-41d4-a704-466d37857c82

press 'q' key to exit xfs_db utility. There are really a lot of information: among all of these there is the root inode number:

rootino = 128

this value is the same as ls command output of the mount point of the given filesystem: for example, if /dev/sda1in mounted on /

ls -di /

produces the following output:

128 /

Footnotes

Here it ends our guided tour to POSIX filesystem: I hope you enjoyed it, and that you agree that is nicer having a little bit understanding about it rather than simply know how to operate with it.

I hate blogs with pop-ups, ads and all the (even worse) other stuff that distracts from the topics you're reading and violates your privacy. I want to offer my readers the best experience possible for free, ... but please be wary that for me it's not really free: on top of the raw costs of running the blog, I usually spend on average 50-60 hours writing each post. I offer all this for free because I think it's nice to help people, but if you think something in this blog has helped you professionally and you want to give concrete support, your contribution is very much appreciated: you can just use the above button.

10 thoughts on “POSIX compliant filesystems”

Michael Paoli says:

June 4, 2021 at 4:55 pm

–sparse, etc. – looks like you probably input two en dashes (-) and automangle unhelpfully converted them to a single em dash.

Reply
- Marco Antonio Carcano says:
  
  June 4, 2021 at 8:46 pm
  
  Thank you for reporting it Micheal: I missed it. However it seem even more tricky of this: I had a look to the page source and there are both of the dashes. I suspect there’s something weird within the CSS, although I’m not a CSS expert. I’ll try to investigate.
  
  Reply
- therainisme says:
  
  April 4, 2024 at 5:31 pm
  
  This is the clearest blog I’ve ever seen! Thanks!
  
  Reply
  - Marco Antonio Carcano says:
    
    April 7, 2024 at 7:57 am
    
    Hi, glad to know you like the posts – the mission of this blog is providing professional grade contents in a way that must be clear for everybody. I really need comments: it’s the only way to know how good or bad and interesting or pointless my posts are. Cheers.
    
    Reply
Pedro says:

June 10, 2021 at 6:54 pm

Nice articles.
Good.

Orphaned inodes are inodes that are still marked as used and bound to data blocks on the disk, but with no more directory entries linked to them.
On the EXT filesystem the fsck utility recovers them and puts them into the lost+found directory. But there are situations when you cannot shut down a system neither unmount the filesystem to run fsck so to perform the recovery.

This is Unix classic behavior in action.
You can keep writing on file descriptor after the unlink(2).
The inode is freed but as the writing continues more data blocks get assigned.
If this is a kind of never ending writing process unless you have limits you can even exhaust the file system data blocks.
But no problem, once you close the file descriptor or end the process all data blocks get back to free again.

Take a look at the old classic book by Maurice Bach “The Unix Operating System”.
Describe fsck and this Unix behavior.

By the way there are cases you are unable to shut down a process, even SIGKILL does nothing.
These mean they are in a UNINTERRUBLE SLEEP. It happens sometimes, bad C programming or
the coder did not want any break.
Even shutdown system takes more time to deal with this.
The proc file system first appear on Unix System V before linux.
Linux took it on

Unnamed pipes (ls | more) allocate an inode in it short lifespan.

On the root of any filesystem . and .. are of course on the same inode.
Hard link on directories would cause a potential for a never ending loop so
the kernel does not allow it. Classic Unix again.
Hard links could have been patented by Thompson and Ritchie.

By the way. The internal of open/write/close on directories should always be done idempotent and synchronous because if something goes wrong you can end lossing everything beneath it. A reminder fot the old SCO that did this async with a potential for catastrofy.

Ext file system is a heritage from the BSD Fast File System.
Superblock label and block groups (aligned to cylinders groups) are original
from the BSD FFS.

On critical file systems keep a backup info where are the the backup superblocks
dumpe2fs /dev/sda1 | grep -i superblock > somewhere.txt

Reply
- Marco Antonio Carcano says:
  
  June 10, 2021 at 8:31 pm
  
  Thanks a lot for all the notes, Pedro: they’re the cherry on the top. I’m really pleased to see contribs like this.
  
  Reply
  - Pedro says:
    
    June 11, 2021 at 6:50 pm
    
    Hi Marco
    I should have writen a bit more.
    “If this is a kind of never ending writing process unless you have limits you can even exhaust the file system data blocks. … with no easy clue from where the data blocks are used from. You see block after block beeing vanished from nowhere???”
    Once you close the file descriptor or end the process they go back to free blocks and suddenly all that “lost” space
    “magically reappers”.
    
    Reply
Daisuke says:

February 26, 2022 at 1:29 am

I found `cp sparse/disk2.img backup/disk4.img` generates a little bit different (and interesting) size as your example: 5.7M on Ubuntu 20.04, 4.6M on Ubuntu 18.04, not 16M.
`cp` assumes `–sparse=auto`, which algorithm likely depends on implementations.

Anyway, thank you for great post. It is invaluable for me.

Reply
therainisme says:

April 4, 2024 at 5:32 pm

This is the clearest blog I’ve ever seen! Thanks!

Reply
- Marco Antonio Carcano says:
  
  April 7, 2024 at 7:57 am
  
  Hi, glad to know you like the posts – the mission of this blog is providing professional grade contents in a way that must be clear for everybody. I really need comments: it’s the only way to know how good or bad and interesting or pointless my posts are. Cheers.
  
  Reply

Elements of a POSIX filesystem

Inodes

File

File size and size on disk

Sparse files

Orphaned inodes

Special File types available in POSIX filesystems

Device Files

named pipe

UNIX Domain Socket

Directory

Hard links

Soft links

VFS - Virtual FileSystem

Superblock

Recovering from a corrupted superblock

Footnotes

Michael Paoli says:

Marco Antonio Carcano says:

therainisme says:

Marco Antonio Carcano says:

Pedro says:

Marco Antonio Carcano says:

Pedro says:

Daisuke says:

therainisme says:

Marco Antonio Carcano says:

Leave a Reply to Marco Antonio Carcano Cancel Reply