Don’t be tempted to skip this post: you would miss something valuable. Of course most of us know how to operate a filesystem, but the underlying details of POSIX filesystems are not broadly known by most of the people. In this post I describe them quite accurately, trying to keep at a level that may intrigue, but avoiding to be too theoretical. Having such an expertise is certainly one of the things that make the difference from a technician and a skilled professional. In addition to that, this skill may really save your life when facing weird things that sometimes may arise.
Elements of a POSIX filesystem
Directory, inodes and data blocks are the basic elements of a POSIX compliant filesystems: inodes are objects you should have already been acquainted to, so we begin our tour from them.
Inodes
An inode is a data structure that provides metadata information of an object. There are a few different types of inodes (file, directory, socket, special device); they provide information such as:
- a timestamp of when the inode was last modified (the ctime )
- a timestamp of when the file's contents were last modified (the mtime )
- a timestamp of when file was last accessed (the atime ) for read ( ) or exec ( )
- the numeric user id (UID) of the user that is the owner of the backed data
- the numeric group id (GID) of the group that is the owner of the backed data
- the POSIX mode bits (often called permissions or permission bits )
- ...
Inode's metadata also provide information to locate the initial data blocks belonging to the file. Note that each inode is an entry of the inode table: since the size of each entry is fixed, the inode table is pre-allocated at format time. This means that, once a filesystem has been formatted, the inode table gets formatted to store a given number of entries aka inodes. This is actually the upper limit of the objects you can create into the POSIX filesystem, whether they are files or directories.
Let's see this in action: by using the stat command we can get information of the inode linked to the directory entry called "bashrc" of the "/etc" directory:
cd /etc
stat bashrc
the outcome is:
File: bashrc
Size: 3001 Blocks: 8 IO Block: 4096 regular file
Device: 801h/2049d Inode: 157687 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
Context: system_u:object_r:etc_t:s0
Access: 2020-03-21 13:14:18.597717832 +0000
Modify: 2018-09-10 11:51:03.000000000 +0000
Change: 2019-10-25 18:37:43.962000000 +0000
Birth: -
the previous output shows that it is the inode of a regular file with inode number 157687.
File
It is a set of data blocks used to permanently store contents, such as documents, images and so on. Note that under the hood, a file does not have a name: it's only a logical object made of a set of blocks that can be either directly referenced from the inode or indirectly referenced from another data block. The name of the file is instead stored inside a directory as a directory entry.
File size and size on disk
We can now start talking about the size of files. They actually have two sizes:
- the length of the data in the file (let's call it simply “size”)
- the actual size used on disk
Most of the time they differs: let's look for example at the sizes of /etc/passwd file (-l shows the size, -s the size on disk, -h to print it human-readable)
ls -lsh passwd
4.0K -rw-r--r--. 1 root root 1.8K Mar 11 20:39 passwd
the actual length of the data in the file is 1.8 KB, that is how far you have to seek from the Beginning Of File (BOF) to get to the End Of File (EOF). It is right to think of it as the actual data size, whereas 4KB is the actual size used on disk.
This latter information is the same reported by du utility by the way:
du -sh passwd
4.0K passwd
file size and actual size on disk differs because the quantum to allocate disk space is the block, and in this example the underlying filesystem has been created with 4KB block size - on XFS filesystems we can verify this as follows:
xfs_info / | grep "^data[ ]*=[ ]*" | sed 's/.*\(bsize=[0-9]*\).*/\1/'
bsize=4096
So in this example the file owns a whole block of 4KB of data, but the length of the data from BOF to EOF is only 1.8KB.
Sparse files
If the underlying filesystem supports them (as most of the Unix file systems do), sparse files are this exception: they differs from the "regular" ones in the way underlying blocks on the disk are allocated - a sparse file declares a given length from BOF to EOF (so declares a data size) but sets blocks references to null.
We can see this in action by using dd command line utility as follows:
dd if=/dev/zero of=/tmp/disk.img bs=1 count=0 seek=1G
it creates an empty file of 1GB (seek=1GB means 1GB in size, that is 1GB from BOF to EOF), but without writing any block (count=0). That is no block for all over the length, so null blocks.
If the underlying filesystem has sparse files support, it does not allocates the backing disk blocks, leaving the file at 0 disk size:
ls -lsh /tmp/disk.img
0 -rw-r--r--. 1 macarcano macarcano 1.0G Mar 21 14:12 /tmp/disk.img
du reports the same information:
du -sh /tmp/disk.img
0 /tmp/disk.img
disk blocks are actually allocated only when needed, when something is actually writing into the file.
Now let's remove the sparse file and go on:
rm -f /tmp/disk.img
Let's create a smaller sparse file to play with - this time we create it using truncate utility:
cd /tmp
mkdir sparse
truncate -s 16M sparse/disk.img
now let's format it using a data block size smaller than the default one and eventually mount it:
mkfs.xfs -b size=1024 sparse/disk.img
sudo mount -o loop sparse/disk.img /mnt
sudo chmod 777 /mnt
now let's store a 12 characters string into a file:
echo "Hello world" > /mnt/foo
let's check the sizes:
ls -lsh /mnt/foo
1.0K -rw-rw-r--. 1 macarcano macarcano 12 Mar 21 14:28 /mnt/foo
everything as expected: size is 12 Bytes, whereas size on disk is 1K, that matches the block size we just picked up at formatting time - indeed:
xfs_info /mnt | grep "^data[ ]*=[ ]*" | sed 's/.*\(bsize=[0-9]*\).*/\1/'
bsize=1024
Now create a 2M file of random data inside it and unmount the filesystem
dd if=/dev/urandom of=/mnt/bar.dat bs=1M count=2
sudo umount /mnt
Note how ls displays both the size on disk (5.7M) and declared size (16M):
ls -lsh sparse/disk.img
5.7M -rw-rw-r--. 1 macarcano macarcano 16M Mar 21 14:30 sparse/disk.img
Sparse files are generally used to store loop filesystems. Many virtualization solutions rely on sparse files to create VM thin-provisioned disks: their usage should be carefully evaluated, since they obviously tend to fragmentation.
You can move sparse file from a location to another one within the same filesystem using mv: since mv operates on on the same filesystem, it does not touch inodes, and so it does not alter the file size - it is simply removing the directory entry from the origin and adding a new entry to the destination directory.
ls -i sparse
18636774 disk.img
mv sparse/disk.img sparse/disk2.img
ls -i sparse
18636774 disk2.img
The copy (cp) command supports spare files but you should supply --sparse option, for example:
mkdir backup
cp --sparse=always sparse/disk2.img backup/disk3.img
ls -lsh backup/disk3.img
4.1M -rw-rw-r--. 1 macarcano macarcano 16M Mar 21 15:20 backup/disk3.img
note how the actual size changed (4.1M) , but is lower than the declared size (16M).
Since we are human being, we may copy a sparse file like a regular one by mistake, omitting the proper options not to allocate backing blocks for unused data:
cp sparse/disk2.img backup/disk4.img
ls -lsh backup/disk4.img
16M -rw-rw-r--. 1 macarcano macarcano 16M Mar 21 15:21 backup/disk4.img
we can easily recover an error like this using fallocate command utility on the destination file using --dig-holes option:
fallocate --dig-holes backup/disk4.img
let's check the size again: it should be the size of the original sparse file
ls -lsh backup/disk4.img
4.2M -rw-rw-r--. 1 macarcano macarcano 16M Mar 21 15:20 backup/disk4.img
When dealing with a lot of files or remote backups, you can keep things in synch by using rsync with -S (sparse) option:
rsync -S backup/disk4.img backupuser@remote/home/macarcano
Let's check the file sizes on the remote server:
ls -lash /home/macarcano/disk4.img
8.0M -rw-rw-r--. 1 backupuser backupuser 16M Mar 21 15:23 disk4.img
as you see, the archived file is still a sparse file, although it consumes more space: 8.0M, whereas the original one was of only 4.2M.
Orphaned inodes
Orphaned inodes are inodes that are still marked as used and bound to data blocks on the disk, but with no more directory entries linked to them.
On the EXT filesystem the fsck utility recovers them and puts them into the lost+found directory. But there are situations when you cannot shut down a system neither unmount the filesystem to run fsck so to perform the recovery.
Let's see a real life example that happened years ago: on a system that was running out of disk space somebody archived (copied and deleted) a log file while a service was still writing onto it. The outcome was that the inode of the file became orphaned: so still there (consuming disk space), still receiving log messages, but even worse, still growing. Reloading the service would have led to losing all the messages logged by the service after having copied the file - this was unacceptable. Eventually I got the idea to try to cat to a new archive file from the file descriptor that was still open, then truncate the file through the file descriptor and reload the service. It worked.
As an example, let say there is a running nginx service instance that among other files is logging into /var/log/nginx/error.log, and that this log file has been deleted: we can use the lsof utility and search for open files marked as deleted:
lsof | grep nginx | grep "(deleted)"
the output should look like as follows:
nginx 1121 root 2w REG 8,1 0 25527982 /var/log/nginx/error.log (deleted)
Please note the PID 1121, and the file descriptor 8. Now, by exploiting the /proc filesystem we can cat from the open file descriptor into /tmp/archive.tgz, truncate the file descriptor and reload
It is like doing this in the above example:
cat /proc/1121/fd/8 | tar cfvz /tmp/archive.tgz
truncate -s0 /proc/1121/fd/8
systemctl reload nginx
Special File types available in POSIX filesystems
Inodes can also be used also to backup information of object that are not real files or directories, providing the user the feel that those objects are files: very often I hear the famous sentence "on UNIX everything is a file": this is not the right way of saying it: the truth is that "on UNIX almost everything is available as a file". We are about to discover why.
Device Files
These are special files that provide direct access to devices, such as disks, the keyboard, the mice and so on. There are two kind of device files:
Device files are created using the mknod utility: the syntax is as follows:
mknod device-name device-type major-number minor-number
- major number identifies the device driver used to access the device
- minor number is specific to the device driver implementation
for example, to create a block device file:
sudo mknod /dev/loop1 b 7 0
to create a character device file:
sudo mknod /dev/another-random c 1 8
Note that since both of the numbers are stored using a single byte, the range is 0-254. The list of major and minor numbers used by the Linux kernel is available at https://www.kernel.org/doc/Documentation/admin-guide/devices.txt
named pipe
Pipes are a one way only channel used by processes to exchange data. They are part of the Inter Process Communication subsystem. Conversely from traditional pipes that do not have a name, named pipes are special files a process can attach to send streams to and read streams from: it is a convenient way to let a process to stream data to another one.
Named pipes can be either created by mknod or mkfifo command line utilities.
For example, the statement:
mkfifo -m 0666 foopipe
creates a file called foopipe into the current directory. We can then see the outcome:
ll foopipe
the output is as follows:
prw-r—r-- 1 macarcano macarcano 0 Sep 15 12:21 foopipe
UNIX Domain Socket
It is an IPC facility that provides a data communications endpoint for exchanging data between processes. There are three kind of UNIX domain sockets:
- unnamed - these sockets are created to exchange data, but does not have a name
- abstract - these sockets have a name that should be unique inside an abstract namespace. Utilities such as lsof and fuser displays this kind of sockets with a name prepended by the at (@) character
- pathname - these sockets are connected to an inode and so are exposed through the filesystem
Let's create a pathname UNIX domain socket to play with - we need to install netcat first:
sudo dnf install -y netcat
let's run the first netcat instance, that acts like a server:
sudo nc -l -N -U /var/run/foo.socket
the above statement launches netcat in listening mode on /var/run/foo.socket. Open another terminal and spawn a client connected to the same socket:
sudo nc -N -U /var/run/foo.socket
on the terminal running the client, type "Hello, this is the client" and hit enter: you'll get it echoed on the terminal running the server.
Hello, this is the client
Now, on the terminal running the server, type "Hi dear, nice to meet you" and hit enter: it will be echoed on the client terminal.
Hi dear, nice to meet you
On the terminal running the server, press CTRL+C to terminate it: the client will be immediately disconnected.
Note that the UNIX domain socket file opened by netcat does not get removed:
ls -al /var/run/foo.socket
srwxr-xr-x. 1 root root 0 May 3 19:07 /var/run/foo.socket
It should be explicitly unlinked.
sudo rm /var/run/foo.socket
Directory
A directory is a special object (each filesystem implementation has its own data model, that spans from a simple structure of rows to B+Tree).
What is common among all filesystem implementations is that
- each entry should at least have a logical name along with the number of the inode that contains the metadata backing up the entry
- the name can be any string made of any characters with the exception of "/" character and of the "null" character (usually a zero byte) - of course the length of the string is limited too, but the actual length depends form the filesystem implementation
- the name is the primary key – that is you cannot have two entries with the same path
The fact that the name is the primary key is really easy to be verified: let's try to create a directory with the same name of an already existent file:
cd /etc
mkdir bashrc
the outcome is:
mkdir: cannot create directory 'bashrc': File exists
it obviously fails – the directory object named “etc”, direct children of the root directory (/), does already have an entry called "bashrc".
We can use some commonly used utilities to guess what is the structure of a directory under the hood.
Let's change to the root directory (/)
cd /
We can use the -i option to see the inode number of each entry of the directory, along with the -a option that show also hidden contents:
ls -ai -1
from the output we can derive a table like the following one (I picked up only a few entries, just to let you get the gist of things):
Are you enjoying these high quality free contents on a blog without annoying banners? I like doing this for free, but I also have costs so, if you like these contents and you want to help keeping this website free as it is now, please put your tip in the cup below:
Even a small contribution is always welcome!
Inode
Dot (.) and double dot (..) are placeholders:
- dot (.) refers to the directory itself
- double dot (..) refers to the parent directory object
Since we are listing the contents of the root directory, . and .. refers to the same inode number.
The other entries are the actual names assigned to the object backed by the inode number.
Let's repeat the exercise, this time with “etc” directory entry. Let's change to that directory first:
cd etc
and now, let's enumerate the contents:
ls -ai -1
the derived table, considering only a few objects is:
Inode
As expected, the inode of this directory is 132, and the inode of its parent is 128. Note how the name of the directory itself ("etc") is not stored inside it: if you want to get it you should lookup the parent inode (128), read the directory object and locate the entry that has the inode 132, that actually is "etc".
Now let's create a new directory into the current one ("/etc") - we have to use "sudo" since we are into a directory that can be modified only by users with administrative privileges:
sudo mkdir foo
ls -ai -1 foo
the derived table for the foo directory object is:
Inode
please note how foo directory object does not contains its own name, and how an additional entry with the name foo has been added to /etc directory object - "| grep foo" is used to filter the output so to list only lines that contains foo:
ls -ai -1 |grep foo
the derived table entry is:
Inode
So inode 9060508 is referenced twice: by foo entry in “etc” directory object and by “.” entry in “foo” directory object.
Getting back to the "foo" directory object, since it is empty it has just two entries: . and ..
stat /etc/foo |grep Links |sed 's/.*\(Links: [0-9]*\)/\1/'
Links: 2
Let's create /etc/foo/bar directory :
sudo mkdir /etc/foo/bar
and check the number of links again
stat /etc/foo |grep Links |sed 's/.*\(Links: [0-9]*\)/\1/'
Links: 3
As seen before, adding files causes a new entry to be written into the target directory object, whereas it does not impact the number of links to the newly created directory object, that, since it is empty, is always 2 (. and ..).
Hard links
A question that should arise is: since the primary key is the name of the directory entry, is it allowed to have more than one directory entries with the same inode number? The answer is yes – you are simply assigning different names to the same inode number (and so to a file or a directory object backed by that inode number). This is called hard linking.
Let's try it: create a hard link called /home/macarcano/bashrc to /etc/bashrc file (sudo is required due to permissions of /etc/bashrc file)
sudo ln /etc/bashrc /home/macarcano/bashrc
stat /etc/bashrc |grep Links |sed 's/.*\(Links: [0-9]*\)/\1/'
Links: 2
note how links to /etc/bashrc have grown to 2. Let's check the inode number of both:
ls -i -1 /etc/bashrc /home/macarcano/bashrc
157687 /etc/bashrc
157687 /home/macarcano/bashrc
it is straightforward that both /etc/bashrc and /home/macarcano/bashrc are pointing to the same inode (157687), that is both directory objects "etc" and "macarcano" contain an entry named "bashrc" linked to the inode 157687.
Another thing I want to remark is that POSIX permissions are part of the inode: this means that changing permissions to /home/macarcano/bashrc means changing permissions to 157687 inode:
ls -l /etc/bashrc /home/macarcano/bashrc
-rw-r--r--. 2 root root 3001 Sep 10 2018 /etc/bashrc
-rw-r--r--. 2 root root 3001 Sep 10 2018 /home/macarcano/bashrc
sudo chmod 666 /home/macarcano/bashrc
ls -l /etc/bashrc /home/macarcano/bashrc
-rw-rw-rw-. 2 root root 3001 Sep 10 2018 /etc/bashrc
-rw-rw-rw-. 2 root root 3001 Sep 10 2018 /home/macarcano/bashrc
note how both /etc/bashrc and /home/macarcano/bashrc changed permissions – let's restore permission as they were before:
sudo chmod 644 /etc/bashrc
Now let's remove /home/macarcano/bashrc and check what happens to /etc/bashrc:
sudo rm /home/macarcano/bashrc
ls -i -1 /etc/bashrc
157687 /etc/bashrc
stat /etc/bashrc |grep Links |sed 's/.*\(Links: [0-9]*\)/\1/'
Links: 1
Inode 157687 is still there, along with its contents. If we would remove /etc/bashrc (don't do it), even the last directory entry in the "etc" directory object that is linked to the inode disappears – Links becomes 0 – and the inode, if it is not used by any file descriptor, is actually deleted.
The next question that arises is certainly: can we create a hard link to a directory same way we did to the file?
Let's try it:
ln /etc/kernel /etc/krnl
ln: /etc/kernel: hard link not allowed for directory
OK, so we cannot.
the kernel does not allow it".
Let's install strace to check the involved syscall and make some guesses:
sudo dnf install -y strace
let's run the same command wrapped by strace:
strace -e lstat ln /etc/kernel /etc/krnl
stat("/etc/kernel", {st_mode=S_IFDIR|0755, st_size=23, ...}) = 0
ln: /etc/kernel: hard link not allowed for directory
+++ exited with 1 +++
this highlights that the syscall lstat is aware of the type of the inode: this confirms what I told you before: there are several types of inodes.
So what is a hard link? Is a convenient way to add into a directory object an alias to an inode that points to a file of any kind but a directory; data will survive until there is at least one directory entry linked to the inode.
Soft links
But what about if we do need to create an alias of a directory or between files of different filesystems? The answer is simple: we can create a soft link:
ln -s /etc/foo /tmp/bar
ls -li /etc |grep foo | awk '{print $1}'
9060508
ls -li /tmp |grep bar | awk '{print $1}'
207123
please note how this time we have two different directory entries pointing to their own inode: /etc/foo is linked to inode 9060508 whereas /tmp/bar is linked to 207123. Inode 207123 is a new inode linked that describes the path to reach /etc/foo. This means that if we delete /etc/foo, /tmp/bar survives as an object, but everything get lost: /etc/foo with its data is gone and all that remains is a broken pointer:
sudo rm -rf /etc/foo
ls -li /etc |grep foo
ls -li /tmp |grep bar
207123 lrwxrwxrwx. 1 root root 8 Mar 22 15:01 bar -> /etc/foo
Obviously, everything said for soft links to a directory applies the same way to soft links to files.
VFS - Virtual FileSystem
The Virtual FileSystem (VFS) is an abstraction layer between the application program and the filesystem implementation, aimed at providing support for many different kinds and types of filesystems (disk-based, such as ext3,xfs; network, such as NFS, and special filesystems such as "proc").
It introduces a common file model, specifically geared toward Unix filesystems, to represent all supported file systems. Non-UNIX filesystems must map their own concepts into the common file model - For example, FAT file systems do not have inodes.
As each file system gets initialized (it can either be statically compiled or loaded as a module), it registers itself within the VFS abstraction layer.
From a bird view, the main components of the VFS common file model are:
it is a component so important that it is stored in multiple copies. Besides containing a pointer to the first VFS inode (inode 2, since 0 and 1 are reserved) on the file system, each VFS superblock also contains information and pointers to routines that perform particular functions.
For example, considering a mounted EXT2 filesystem, each superblock contain:
- a pointer to the EXT2 specific inode reading routine;
- pointers to system routines that are called when the system requires to access directories and files: these routines traverse the VFS inodes in the system, store results into the inode_cache, so to speed up subsequent access to the same inodes.
Whenever a process opens a file, it gets a file descriptor that is bound to the file component: the information comes from the inode linked to the directory entry (dentry) relevant to that particular file. Since access to these kind of information can happen quite often, VFS maintains a few caches:
Let's have a deeper look at the superblock.
Superblock
Roughly put, the superblock can be seen as "the metadata of the whole filesystem". Its format is tightly bound to the filesystem type, but usually it contains information such as:
- filesystem type
- block size
- number of the inode of the file containing the root directory
- the number of free blocks
- the location of the root directory
and so on. As you can easily guess, all the information required to mount the filesystem are fetched from the superblock, so if it is damaged, the filesystem cannot be mounted anymore.
For this reason, almost every filesystem implementation usually keeps multiple copies of the superblock.
For example:
- extended filesystem (ext2, ext3, ext4): initially used to put a copy of the superblock at the beginning of any block. Then, to reduce wasted resources, they decided to put copies only at the beginning of block 0, 1 and powers of 3,5 and 7 (sparse_super feature). Now a days superblock backups are located at the beginning of block group #1 and in the last block group (sparse super 2). The copy at the beginning of block 0 is called primary superblock.
- XFS: v5 uses the first 512 bytes at the beginning of each Allocation Group.
Superblock label and block groups (aligned to cylinders groups) are original
from the BSD FFS".
Recovering from a corrupted superblock
The superblock is read at mount time: if it's damaged, and so invalid, you'll get a "corrupted superblock" message.
If this happens with a filesystem of the EXT family, you can get information on superblock backup using dump2fs utility as by the following example:
sudo dumpe2fs /dev/sda1 | grep -i superblock
that checks for backups of the superblock of the filesystem stored into /dev/sda1disk partition.
The output looks as follows:
dumpe2fs 1.44.3 (10-July-2018)
Primary superblock at 1, Group descriptors at 2-2
Backup superblock at 8193, Group descriptors at 8194-8194
from the output we see that there's a backup of the superblock at 8193: this means that we can use it to recover the filesystem if necessary; the command to use would be:
sudo e2fsck -b 8193 /dev/sda1
dumpe2fs /dev/sda1 | grep -i superblock > somewhere.txt".
Life with XFS is even easier: xfs_repair automatically checks the primary and backups superblocks and restores a corrupted primary superblock if necessary.
However, for the sake of completeness, let's have a look at the superblock of the XFS filesystem too: as previously told, when using XFS, copies of the superblock are stored at the beginning of each Allocation Group.
In the following example we are launching xfs_db utility to check the XFS filesystem stored into /dev/sda1 partition: type the following statement
xfs_db /dev/sda1
now type sb to focus on the primary superblock, or sb followed by the number of the Allocation Group to use a backup copy of the superblock.
For example, to select the superblock in Allocation Group 1, type:
xfs_db> sb
you can now print superblock information as follows:
xfs_db> p
the following box contains a snippet with some sample output:
magicnum = 0x58465342
blocksize = 4096
dblocks = 5120
rblocks = 0
rextents = 0
uuid = 5c836a30-db4e-41d4-a704-466d37857c82
press 'q' key to exit xfs_db utility. There are really a lot of information: among all of these there is the root inode number:
rootino = 128
this value is the same as ls command output of the mount point of the given filesystem: for example, if /dev/sda1in mounted on /
ls -di /
produces the following output:
128 /
Footnotes
Here it ends our guided tour to POSIX filesystem: I hope you enjoyed it, and that you agree that is nicer having a little bit understanding about it rather than simply know how to operate with it.
Michael Paoli says:
–sparse, etc. – looks like you probably input two en dashes (-) and automangle unhelpfully converted them to a single em dash.
Marco Antonio Carcano says:
Thank you for reporting it Micheal: I missed it. However it seem even more tricky of this: I had a look to the page source and there are both of the dashes. I suspect there’s something weird within the CSS, although I’m not a CSS expert. I’ll try to investigate.
therainisme says:
This is the clearest blog I’ve ever seen! Thanks!
Marco Antonio Carcano says:
Hi, glad to know you like the posts – the mission of this blog is providing professional grade contents in a way that must be clear for everybody. I really need comments: it’s the only way to know how good or bad and interesting or pointless my posts are. Cheers.
Pedro says:
Nice articles.
Good.
Orphaned inodes are inodes that are still marked as used and bound to data blocks on the disk, but with no more directory entries linked to them.
On the EXT filesystem the fsck utility recovers them and puts them into the lost+found directory. But there are situations when you cannot shut down a system neither unmount the filesystem to run fsck so to perform the recovery.
This is Unix classic behavior in action.
You can keep writing on file descriptor after the unlink(2).
The inode is freed but as the writing continues more data blocks get assigned.
If this is a kind of never ending writing process unless you have limits you can even exhaust the file system data blocks.
But no problem, once you close the file descriptor or end the process all data blocks get back to free again.
Take a look at the old classic book by Maurice Bach “The Unix Operating System”.
Describe fsck and this Unix behavior.
By the way there are cases you are unable to shut down a process, even SIGKILL does nothing.
These mean they are in a UNINTERRUBLE SLEEP. It happens sometimes, bad C programming or
the coder did not want any break.
Even shutdown system takes more time to deal with this.
The proc file system first appear on Unix System V before linux.
Linux took it on
Unnamed pipes (ls | more) allocate an inode in it short lifespan.
On the root of any filesystem . and .. are of course on the same inode.
Hard link on directories would cause a potential for a never ending loop so
the kernel does not allow it. Classic Unix again.
Hard links could have been patented by Thompson and Ritchie.
By the way. The internal of open/write/close on directories should always be done idempotent and synchronous because if something goes wrong you can end lossing everything beneath it. A reminder fot the old SCO that did this async with a potential for catastrofy.
Ext file system is a heritage from the BSD Fast File System.
Superblock label and block groups (aligned to cylinders groups) are original
from the BSD FFS.
On critical file systems keep a backup info where are the the backup superblocks
dumpe2fs /dev/sda1 | grep -i superblock > somewhere.txt
Marco Antonio Carcano says:
Thanks a lot for all the notes, Pedro: they’re the cherry on the top. I’m really pleased to see contribs like this.
Pedro says:
Hi Marco
I should have writen a bit more.
“If this is a kind of never ending writing process unless you have limits you can even exhaust the file system data blocks. … with no easy clue from where the data blocks are used from. You see block after block beeing vanished from nowhere???”
Once you close the file descriptor or end the process they go back to free blocks and suddenly all that “lost” space
“magically reappers”.
Daisuke says:
I found `cp sparse/disk2.img backup/disk4.img` generates a little bit different (and interesting) size as your example: 5.7M on Ubuntu 20.04, 4.6M on Ubuntu 18.04, not 16M.
`cp` assumes `–sparse=auto`, which algorithm likely depends on implementations.
Anyway, thank you for great post. It is invaluable for me.
therainisme says:
This is the clearest blog I’ve ever seen! Thanks!
Marco Antonio Carcano says:
Hi, glad to know you like the posts – the mission of this blog is providing professional grade contents in a way that must be clear for everybody. I really need comments: it’s the only way to know how good or bad and interesting or pointless my posts are. Cheers.