Despite it is a boring task, comparing file is a need that sometimes IT professionals have to deal with: there are many reasons for having to deal with this:
- verify if a file has been corrupted
- verify if a file has been tampered
- compare two versions of a configuration file to see where they differ - this happens quite often when after a configuration modification an application stops working as it should and you have to guess why
- generate a patch that can be used to go back and forth to current and previous version of the same files
and so on.
This post explains how to deal with these needs on Linux using the tools provided by the Linux distribution.
Checking the integrity of files - checksums
A checksum is a fixed-size binary string derived from a block of data. Comparing the pre-computed checksum of a file stored somewhere (for example in another file) with its checksum computed on the fly is a very handy way to verify if anything has changed in the file.
The straightforward usage of checksums is to verify the integrity of a file, for example after recombining a file from a multispan archive after downloading each of its parts.
This is the reason why it is very common to see files published along with their checksum files.
Look for example at the download page of the GiTea project:
the files with the ".sha256" suffix are the checksum files – by the suffix we can guess that they have been generated using the SHA256 checksum algorithm.
Checksum Algorithms
There are several checksum algorithms: we can easily verify which are the ones available on our system as follows:
rpm -ql coreutils | grep "[^/]sum" | cut -d '/' -f 4
the outcome on my system is:
b2sum
cksum
md5sum
sha1sum
sha224sum
sha256sum
sha384sum
sha512sum
a few tips about any of them:
- cksum generates the "classic" 32 bits CRC
- b2sum provides BLAKE2 checksums (BLAKE2s, BLAKE2b, BLAKE2sp, BLAKE2bp) - these are a more performing improvement to the broadly known MD5 - BLAKE2sp and BLAKE2bp give their best on systems with large vector units or multiple cores to parallelize hashing long messages.
- SHA1 is 160 bit
- SHA2 is more modern than SHA1 and it mostly used with the 256 bit length flavour, but includes also SHA224, SHA384 and SHA512 that are only alternate bit-lengths of the same algorithm
Let's see all of this in action: let's download a xz compressed file containing the GiTea binary:
curl -L -o gitea-1.13.0-linux-386.xz https://dl.gitea.io/gitea/1.13.0/gitea-1.13.0-linux-386.xz
let's extract it from the xz archive:
xz --decompress gitea-1.13.0-linux-386.xz
The outcome is the "gitea-1.13.0-linux-386" file.
We can trust xz: it has its own CRC algorithms to make sure to extract an integer version of the file, but as a theoretical exercise, in order to learn how do checksum works, we pretend that we do not trust its reliability and so we want to make sure that "gitea-1.13.0-linux-386" file is not corrupted.
We can simply download the SHA256 checksum file of the GiTea binary itself as follows:
curl -L -o gitea-1.13.0-linux-386.sha256 https://dl.gitea.io/gitea/1.13.0/gitea-1.13.0-linux-386.sha256
let's have a look to the contents of the "gitea-1.13.0-linux-386.sha256" file:
cat gitea-1.13.0-linux-386.sha256
the output is as follows:
cb8d4fe1168926282512012abe85e879fbfd0ffe326536dddf884889ff73b915 gitea-1.13.0-linux-386
the value on the left is the checksum, whereas the value on the right is the file it has been computed from.
Let's check the integrity of the "gitea-1.13.0-linux-386" file.
sha256sum -c gitea-1.13.0-linux-386.sha256
this command computes the checksum of the "gitea-1.13.0-linux-386" file and checks if it matches the one contained in the related line of the "gitea-1.13.0-linux-386.sha256" file.
The output is:
gitea-1.13.0-linux-386: OK
this is good, the extracted file is not corrupted.
Checking files have not been tampered
Beside the straightforward usage to verify the file integrity (for example after recombining a file from a multispan archive downloaded), checksum is a powerful allied for security matters, since it let you quickly identify tampered files on a system: the only requirement is having a list with the original checksums of the files you want to monitor the integrity.
The checksum file we have just seen contains the checksum of only one file, but a checksum file can be used to verify also the integrity of more than just one file.
As an example let's create a MD5 checksum file with the checksums of the files stored beneath the "/etc/ssh" directory:
sudo md5sum /etc/ssh/* > /tmp/ssh.md5
we need sudo because access to "/etc/ssh" directory is restricted.
Let's verify them:
sudo md5sum -c /tmp/ssh.md5
the output is:
/etc/ssh/moduli: OK
/etc/ssh/ssh_config: OK
/etc/ssh/sshd_config: OK
/etc/ssh/ssh_host_ecdsa_key: OK
/etc/ssh/ssh_host_ecdsa_key.pub: OK
/etc/ssh/ssh_host_ed25519_key: OK
/etc/ssh/ssh_host_ed25519_key.pub: OK
/etc/ssh/ssh_host_rsa_key: OK
/etc/ssh/ssh_host_rsa_key.pub: OK
But what is happening under the hood? Let's have a look at the contents of the checksum file:
cat /tmp/ssh.md5
the output is:
6fe064066e7fae1cda47dc6c718217da /etc/ssh/moduli
36276da2f6301b771d8a39fbcc620101 /etc/ssh/ssh_config
183229eb4bfc3f45c947740abea9ac42 /etc/ssh/sshd_config
5204bda7de31cbde773e5048fafb7a04 /etc/ssh/ssh_host_ecdsa_key
c80b9fe0b57d7a3c89bd926ee6efdae2 /etc/ssh/ssh_host_ecdsa_key.pub
d465c66694537b7356a4163905792478 /etc/ssh/ssh_host_ed25519_key
ba30ec40787fe6ae3783d28e30859022 /etc/ssh/ssh_host_ed25519_key.pub
d711d01f43aac78754d7b4ebc50152a8 /etc/ssh/ssh_host_rsa_key
21ffa77cab9d66af80672dc9dc039bfd /etc/ssh/ssh_host_rsa_key.pub
so it has the same format of the checksum file we downloaded from GiTea website, but this time it contains a list of checksums, one per row, and the path to the files the checksum refers to, separated by a space.
You may be tempted to think that you have found a way to find out which files have been tampered by evil people, ... not quite. You are still missing the most important thing: evil people can simply regenerate the file with the list of checksums so that replacing the checksum of the original files with the checksum of the tampered file.
When dealing with security, the pillar is the trust, and we are still missing the trust on the checksum file, ... a trust that by the way we can easily get by digitally signing the checksum file with our private key: doing this way, before verifying checksums, we verify the signature of the checksum files itself - evil people is not able to sign the checksum file with our private key, ... at least until we keep it safe and sound.
Checking file differences
Another very common need is checking the difference of two versions of the same file.
As an example of how to deal with this use case, let's modify SSH to disable login as root user. We can achieve this with a simple sed one-liner:
sudo sed -i.bak -r 's/^[ ]*PermitRootLogin[ ]+.*/PermitRootLogin no/' /etc/ssh/sshd_config
we must of course reload ssh service to apply it:
sudo systemctl reload sshd
Because of the -i parameter with the ".bak" trailing word, the above sed command, besides modifying the target file, creates a backup copy of the original version beneath the same path of the original one with a trailing ".bak" in the file name.
We can use diff to compare them to see the difference:
sudo diff /etc/ssh/sshd_config /etc/ssh/sshd_config.bak
the output is:
43c43
< PermitRootLogin no
---
> PermitRootLogin yes
this confirms that the only difference between the original and new file is the "PermitRootLogin" modification.
What about the md5sum we computed so far? Let's check them:
sudo md5sum -c /tmp/ssh.md5
the output is:
/etc/ssh/moduli: OK
/etc/ssh/ssh_config: OK
/etc/ssh/sshd_config: FAILED
/etc/ssh/ssh_host_ecdsa_key: OK
/etc/ssh/ssh_host_ecdsa_key.pub: OK
/etc/ssh/ssh_host_ed25519_key: OK
/etc/ssh/ssh_host_ed25519_key.pub: OK
/etc/ssh/ssh_host_rsa_key: OK
/etc/ssh/ssh_host_rsa_key.pub: OK
md5sum: WARNING: 1 computed checksum did NOT match
everything as expected:
- look at the line "/etc/ssh/sshd_config: FAILED"
- look at the last line with the summary report: "1 computed checksum did NOT match"
If we'd rather, we can even see the differences putting contents side by side as follows:
sudo diff -y /etc/ssh/sshd_config /etc/ssh/sshd_config.bak
it prints both files, side by side, and highlights differences by putting a pipe character (|) next to the modifications; look at the following snippet of the output of the command - please note that the very most of the line have been cut to improve readability:
# $OpenBSD: sshd_config,v 1.103 2018/04/09 # $OpenBSD: sshd_config,v 1.103 2018/04/09 20:41:22
# This is the sshd server system-wide # This is the sshd server system-wide
# sshd_config(5) for more information. # sshd_config(5) for more information.
…
#LoginGraceTime 2m #LoginGraceTime 2m
PermitRootLogin no | PermitRootLogin yes
#StrictModes yes #StrictModes yes
…
# Example of overriding settings on # Example of overriding settings on
#Match User anoncvs #Match User anoncvs
# X11Forwarding no # X11Forwarding no
# AllowTcpForwarding no # AllowTcpForwarding no
# PermitTTY no # PermitTTY no
# ForceCommand cvs server # ForceCommand cvs server
Checking directory differences
The diff command line utility can be handy also to look at differences in directory contents – for example:
diff --brief -r /tmp/foo /tmp/bar
you can view only the differences by specifying -N option:
diff --brief -Nr /tmp/foo /tmp/bar
Creating A Patch
The diff command can be used also to create a patch file: a patch file is a file containing only the differences between the compared files that can be used to restore a file to its previous version.
As an example, let's create a patch with the differences of of the "/etc/ssh/sshd_config" file we just modified and its backup copy "/etc/ssh/sshd_config.bak":
sudo diff -u /etc/ssh/sshd_config /etc/ssh/sshd_config.bak > /tmp/sshd_config.patch
we need sudo only because it is a file with restricted access.
Let's remove the backup file:
sudo rm -f /etc/ssh/sshd_config.bak
Let's see the contents of the "/tmp/sshd_config.patch" patch file:
cat /tmp/sshd_config.patch
the output is:
--- /etc/ssh/sshd_config 2021-12-08 08:07:13.351410873 +0000
+++ /etc/ssh/sshd_config.bak 2021-12-08 08:03:19.259603482 +0000
@@ -40,7 +40,7 @@
# Authentication:
#LoginGraceTime 2m
-PermitRootLogin yes
+PermitRootLogin no
#StrictModes yes
#MaxAuthTries 6
#MaxSessions 10
let's try to rollback it to its previous version:
sudo patch -i /tmp/sshd_config.patch /etc/ssh/sshd_config
again, we need sudo only because it is a file with restricted access.
The output is:
patching file /etc/ssh/sshd_config
we can of course use diff and patch to more than just one file, generating multiple patch files, than can be run as a whole using find.
An hands-on example
diff and patch were very popular years ago, when the available SCM were only CVS and later SVN: these tools were very handy when you wanted to enhance or bugfix a third party software - you coded your modifies and once satisfied you created a patch and submit to the project owner for review, hoping he may add the patch somehow to the project.
Nowadays with GIT, forking of repositories and pull requests things are much easier, but knowing how to deal with the old way may help if you hit very old projects.
As a hands on example, let's suppose that we want to use the "fooapp" application, but we need to modify a little bit of some of its files to adapt it to our use case.
I stored the "python-fooporject" with the source of the "fooapp" applications along with the libraries and all the other support files (such as the RPM spec file) on GitHub and created a release. Let's download the release as follows:
wget https://github.com/mac-grimoire/python-fooproject/archive/refs/tags/release/0.0.1.tar.gz
now let's extract the files from the tarball:
tar xfvz 0.0.1.tar.gz
Before attempting any modification, let's run it to see how does it work:
pushd python-fooproject-release-0.0.1/src/bin
./fooapp.py
the output is as follows:
Print the list:
Name: RedHat, Enabled: True
Name: Suse, Enabled: True
Name: CentOS, Enabled: False
Print the list in ascending order:
Name: CentOS, Enabled: False
Name: RedHat, Enabled: True
Name: Suse, Enabled: True
Print the list - after the removal of CentOS:
Name: RedHat, Enabled: True
Name: Suse, Enabled: True
Name: Rocky Linux, Enabled: False
Print the list in descending order:
Name: Suse, Enabled: True
Name: Rocky Linux, Enabled: False
Name: RedHat, Enabled: True
so it basically prints a list of Linux distributions.
We want to improve it by adding an attribute to store the version of the distribution too, so to better identify them.
Let's get back to the old working directory:
popd
We begin by copying the original files into a new directory tree where we'll develop our modified version:
cp -dpR python-fooproject-release-0.0.1 python-fooproject-myfancy
change directory to "python-fooproject-myfancy":
pushd python-fooproject-myfancy
now we are on the root of the directory tree we are about develop into: modify its files as by the following snippets:
File: src/carcano/foolist/foolistitem.py
__author__ = "Marco Antonio Carcano"
__version__ = '1.0.0'
import functools
@functools.total_ordering
class FoolistItem():
"""
Object used as Item into Foolist class
"""
_name = ""
_version = ""
_enabled = False
_next = None
def __init__(self, name, version, enabled=False):
"""
Initalize an Item assigning a name. You can optionally assign a value
to the "enabled" boolean attribute, that defaults to False
:param name: the name to assign to this FoolistItem Object
:param version: the version to assign to this FoolistItem Object
:param enabled: a flag to mark this FoolistItem Object as enabled
or not
"""
self._name = name
self._version = version
self._enabled = enabled
self._next = None
def __str__(self):
"""
Implements the representation of the Item
:returns: pretty-print of this FoolistItem Object
"""
return f"Name: {self._name}, Version: {self._version}, Enabled: {self._enabled}"
def __repr__(self):
"""
Implements the representation of the Item
:returns: an unambiguous representation of this FoolistItem Object
"""
return f"Id: {id(self)}, Name: {self._name}, Version: {self._version}, Enabled: {self._enabled}"
@property
def next(self):
"""
a reference to the next Item when used into a list
"""
return self._next
@next.setter
def next(self, node):
self._next = node
@property
def name(self):
"""
the name of this FoolistItem Object
"""
return self._name
@name.setter
def name(self, name):
self._name = name
@property
def version(self):
"""
the version of this FoolistItem Object
"""
return self._version
@version.setter
def version(self, version):
self._version = version
@property
def enabled(self):
"""
wether or not this FoolistItem Object is enabled
"""
return self._enabled
@enabled.setter
def enabled(self, enabled):
self._enabled = enabled
def __lt__(self, other):
"""
Lower_than comparison implementation
"""
return self.name < other.name
def __eq__(self, other):
"""
Equal_to comparison implementation
"""
if other is not None:
return (self.name) == (other.name) and (self.version) == (other.version)
else:
return (self.name) is None
File: src/carcano/foolist/foolist.py
__author__ = "Marco Antonio Carcano"
__version__ = '1.0.0'
from .foolistitem import FoolistItem
import logging
"""
initialize logger to NullHandler
"""
log = logging.getLogger(__name__)
log.addHandler(logging.NullHandler())
class Foolist():
"""
Object used to implement a list of FoolistItem
"""
_unique = False
def __init__(self):
"""
Initialize the list by setting its head to None
"""
self.head = None
self._unique = False
@property
def unique(self):
"""
if true, set the list so to contain only unique items
"""
return self._unique
@unique.setter
def unique(self, isunique):
log.debug('Package: '+__name__+', setting unique='+str(isunique))
self._unique = isunique
def __iter__(self):
"""
Iterate throughout the list
"""
node = self.head
while node is not None:
yield node
node = node._next
raise StopIteration
def append(self, name, version, enabled=False):
"""
Append a FoolistItem to the list
:param name: the name to assign to the FoolistItem
:param version: the version to assign to the FoolistItem
:param enabled: a flag to mark it as enabled or not
"""
log.debug('Package: ' + __name__ +
',Method: append, Params: name="' + name +
'", version="' + version + '", enabled: ' +
str(enabled))
if self.head is None:
self.head = FoolistItem(name, version, enabled)
return
if self._unique is True:
tmp_node = FoolistItem(name, version, enabled)
for current_node in self:
if self._unique is True and current_node == tmp_node:
log.debug('Package: ' + __name__ +
',Method: append, Msg: skipping since FoolistItem is already present')
return
pass
current_node.next = FoolistItem(name, version, enabled)
def remove(self, name, version):
"""
Removes an item from the list
:param name: the name of the FoolistItem to remove
:param version: the version of the FoolistItem to remove
"""
log.debug('Package: '+__name__+', Method: remove, Params: name="'+name +
'", version="' + version + '"')
for f in self:
if f.name == name and f.version == version:
if f == self.head:
self.head = f._next
else:
hold._next = f._next
del f
break
else:
hold = f
def __str__(self):
"""
Implements the representation of the list
"""
nodes = []
for node in self:
nodes.append(str(node))
return str(nodes)
def __repr__(self):
"""
Implements the representation of the list
"""
nodes = []
for node in self:
nodes.append(repr(node))
return str(nodes)
File: src/bin/fooapp.py
#!/usr/bin/env python3
from carcano.foolist import *
import os
import logging
import logging.config
log_config_paths = [
os.path.dirname(os.path.realpath(__file__))+'/logging.conf',
'/etc/fooapp/logging.conf']
log_enabled = False
for log_config_file in log_config_paths:
if os.path.isfile(log_config_file):
log_enabled = True
break
if log_enabled is True:
logging.config.fileConfig(
fname=log_config_file,
disable_existing_loggers=False
)
logger = logging.getLogger(__name__)
logging.info(__file__+': started')
os_list = Foolist()
os_list.unique = True
os_list.append('RedHat', '8.0', True)
os_list.append('Suse', '11.0', True)
os_list.append('CentOS', '8.0', False)
os_list.append('CentOS', '7.0', False)
print("Print the list:")
for os in os_list:
print(os)
print("Print the list in ascending order by name:")
for os in sorted(os_list):
print(os)
os_list.remove('CentOS', '8.0')
os_list.append('Rocky Linux', '8.0', False)
print("Print the list - after the removal of CentOS 8.0:")
for os in os_list:
print(os)
print("Print the list in descending order by name:")
for os in sorted(os_list, reverse=True):
print(os)
if log_enabled is True:
logging.info(__file__+': finished')
File: src/test/test_foolist.py
#!/usr/bin/env python3
import unittest
from collections.abc import Iterable
from carcano.foolist import *
class Foolist(unittest.TestCase):
os_list = Foolist()
def testFoolistIsIterable(self):
self.assertIsInstance(self.os_list, Iterable)
def testFoolistAppend(self):
self.os_list.append('RedHat', '8.0', True)
self.os_list.append('CentOS', '7.0', False)
self.os_list.append('Suse', '11.0', True)
self.assertIn(FoolistItem('CentOS', '7.0', False), self.os_list)
def testFoolistRemove(self):
self.os_list.remove('CentOS', '7.0')
self.assertNotIn(FoolistItem('CentOS', '7.0', False), self.os_list)
if __name__ == '__main__':
unittest.main()
here we finished modifying the original project; let's pop back to the original directory:
popd
since in the future we may want to create other patches besides this one, so to be tidy we need to create a directory where to store the patch files:
mkdir -p patches
Iet's have a look to the contents of our current working directory:
ls -d1 *
Are you enjoying these high quality free contents on a blog without annoying banners? I like doing this for free, but I also have costs so, if you like these contents and you want to help keeping this website free as it is now, please put your tip in the cup below:
Even a small contribution is always welcome!
the output is as follows:
patches
python-fooproject-myfancy
python-fooproject-release-0.0.1
We are eventually ready to create the patch - simply type the following command:
diff --no-dereference -uarN python-fooproject-release-0.0.1 python-fooproject-myfancy > patches/fancymod.patch
I used a set of options that you can safely use in any circumstance, since they take care of new files, of binary files and of symlinks.
Let's try the patch: we need the original project so to apply it. in order to do so, we have to remove the directory tree containing the files we just developed ("python-fooproject-myfancy") and recreate it with the original files:
rm -rf python-fooproject-myfancy
cp -dpR python-fooproject-release-0.0.1 python-fooproject-myfancy
We are eventually ready to apply the patch: let's change directory
cd python-fooproject-myfancy
and apply the patch using the patch command:
patch -p1 < ../patches/fancymod.patch
the output is as follows:
patching file src/bin/fooapp.py
patching file src/carcano/foolist/foolistitem.py
patching file src/carcano/foolist/foolist.py
patching file src/test/test_foolist.py
as you see, it prints each of the files that gets successfully patched.
Lastly we can try "fooapp" to see how it does work after applying the patch:
cd src/bin
./fooapp.py
the output is as follows:
Print the list:
Name: RedHat, Version: 8.0, Enabled: True
Name: Suse, Version: 11.0, Enabled: True
Name: CentOS, Version: 8.0, Enabled: False
Name: CentOS, Version: 7.0, Enabled: False
Print the list in ascending order by name:
Name: CentOS, Version: 8.0, Enabled: False
Name: CentOS, Version: 7.0, Enabled: False
Name: RedHat, Version: 8.0, Enabled: True
Name: Suse, Version: 11.0, Enabled: True
Print the list - after the removal of CentOS 8.0:
Name: RedHat, Version: 8.0, Enabled: True
Name: Suse, Version: 11.0, Enabled: True
Name: CentOS, Version: 7.0, Enabled: False
Name: Rocky Linux, Version: 8.0, Enabled: False
Print the list in descending order by name:
Name: Suse, Version: 11.0, Enabled: True
Name: Rocky Linux, Version: 8.0, Enabled: False
Name: RedHat, Version: 8.0, Enabled: True
Name: CentOS, Version: 7.0, Enabled: False
As you see, the patch worked, and we have now a more detailed outcome that takes also in account of the version of the Linux distribution.
Footnotes
Here it ends this tutorial on verifying checksums and comparing different file versions: I hope you enjoyed it and that it helped you to better understand this topic.