Despite it is a boring task, comparing file is a need that sometimes IT professionals have to deal with: there are many reasons for having to deal with this:

  • verify if a file has been corrupted
  • verify if a file has been tampered
  • compare two versions of a configuration file to see where they differ - this happens quite often when after a configuration modification an application stops working as it should and you have to guess why
  • generate a patch that can be used to go back and forth to current and previous version of the same files

and so on.

This post explains how to deal with these needs on Linux using the tools provided by the Linux distribution.

Checking the integrity of files - checksums

A checksum is a fixed-size binary string derived from a block of data. Comparing the pre-computed checksum of a file stored somewhere (for example in another file) with its checksum computed on the fly is a very handy way to verify if anything has changed in the file.

The straightforward usage of checksums is to verify the integrity of a file, for example after recombining a file from a multispan archive after downloading each of its parts.

This is the reason why it is very common to see files published along with their checksum files.

Look for example at the download page of the GiTea project:

the files with the ".sha256" suffix are the checksum files – by the suffix we can guess that they have been generated using the SHA256 checksum algorithm.

Sure, this is a subliminal message to push you to try GiTea, a "Painless self-hosted GIT service" - I really love it.

Checksum Algorithms

There are several checksum algorithms: we can easily verify which are the ones available on our system as follows:

rpm -ql coreutils | grep "[^/]sum" | cut -d '/' -f 4

the outcome on my system is:

b2sum
cksum
md5sum
sha1sum
sha224sum
sha256sum
sha384sum
sha512sum

a few tips about any of them:

  • cksum generates the "classic" 32 bits CRC
  • b2sum provides BLAKE2 checksums (BLAKE2s, BLAKE2b, BLAKE2sp, BLAKE2bp) - these are a more performing improvement to the broadly known MD5 - BLAKE2sp and BLAKE2bp give their best on systems with large vector units or multiple cores to parallelize hashing long messages.
  • SHA1 is 160 bit
  • SHA2 is more modern than SHA1 and it mostly used with the 256 bit length flavour, but includes  also SHA224, SHA384 and SHA512 that are only alternate bit-lengths of the same algorithm
SHA256 and MD5 are certainly the most used checksum, but how is a collision likely to happen when using them? The risk of collisions of SHA256 is 1/(2^256) = 8.64e-78, whereas the risk of collisions of MD5 is 1/(2^128) = 2.94e-39. But mind also that SHA256 is generally slower than MD5.

Let's see all of this in action: let's download a xz compressed file containing the GiTea binary:

curl -L -o gitea-1.13.0-linux-386.xz https://dl.gitea.io/gitea/1.13.0/gitea-1.13.0-linux-386.xz

let's extract it from the xz archive:

xz --decompress gitea-1.13.0-linux-386.xz

The outcome is the "gitea-1.13.0-linux-386" file.

We can trust xz: it has its own CRC algorithms to make sure to extract an integer version of the file, but as a theoretical exercise, in order to learn how do checksum works, we pretend that we do not trust its reliability and so we want to make sure that  "gitea-1.13.0-linux-386" file is not corrupted.

We can simply download the SHA256 checksum file of the GiTea binary itself as follows:

curl -L -o gitea-1.13.0-linux-386.sha256 https://dl.gitea.io/gitea/1.13.0/gitea-1.13.0-linux-386.sha256

let's have a look to the contents of the "gitea-1.13.0-linux-386.sha256" file:

cat gitea-1.13.0-linux-386.sha256

the output is as follows:

cb8d4fe1168926282512012abe85e879fbfd0ffe326536dddf884889ff73b915  gitea-1.13.0-linux-386

the value on the left is the checksum, whereas the value on the right is the file it has been computed from.

Let's check the integrity of the "gitea-1.13.0-linux-386" file.

sha256sum -c gitea-1.13.0-linux-386.sha256

this command computes the checksum of the "gitea-1.13.0-linux-386" file and checks if it matches the one contained in the related line of the "gitea-1.13.0-linux-386.sha256" file.

The output is:

gitea-1.13.0-linux-386: OK

this is good, the extracted file is not corrupted.

Do not mix-up file integrity with tamper checking and non-repudiation: checksums are used only to ensure that the file is not corrupted, blindly trusting the contents of the checksum file. If someone tampers with the file he will also publish a checksum file of the tampered file, so when you run the checksum verification command it will report to you that everything is OK, even if you are about to install a tampered package.

Checking files have not been tampered

Beside the straightforward usage to verify the file integrity (for example after recombining a file from a multispan archive downloaded), checksum is a powerful allied for security matters, since it let you quickly identify tampered files on a system: the only requirement is having a list with the original checksums of the files you want to monitor the integrity.

The checksum file we have just seen contains the checksum of only one file, but a checksum file can be used to verify also the integrity of more than just one file.

As an example let's create a MD5 checksum file with the checksums of the files stored beneath the "/etc/ssh" directory:

sudo md5sum /etc/ssh/* > /tmp/ssh.md5

we need sudo because access to "/etc/ssh" directory is restricted.

Let's verify them:

sudo md5sum -c /tmp/ssh.md5

the output is:

/etc/ssh/moduli: OK
/etc/ssh/ssh_config: OK
/etc/ssh/sshd_config: OK
/etc/ssh/ssh_host_ecdsa_key: OK
/etc/ssh/ssh_host_ecdsa_key.pub: OK
/etc/ssh/ssh_host_ed25519_key: OK
/etc/ssh/ssh_host_ed25519_key.pub: OK
/etc/ssh/ssh_host_rsa_key: OK
/etc/ssh/ssh_host_rsa_key.pub: OK

But what is happening under the hood? Let's have a look at the contents of the checksum file:

cat /tmp/ssh.md5

the output is:

6fe064066e7fae1cda47dc6c718217da /etc/ssh/moduli
36276da2f6301b771d8a39fbcc620101 /etc/ssh/ssh_config
183229eb4bfc3f45c947740abea9ac42 /etc/ssh/sshd_config
5204bda7de31cbde773e5048fafb7a04 /etc/ssh/ssh_host_ecdsa_key
c80b9fe0b57d7a3c89bd926ee6efdae2 /etc/ssh/ssh_host_ecdsa_key.pub
d465c66694537b7356a4163905792478 /etc/ssh/ssh_host_ed25519_key
ba30ec40787fe6ae3783d28e30859022 /etc/ssh/ssh_host_ed25519_key.pub
d711d01f43aac78754d7b4ebc50152a8 /etc/ssh/ssh_host_rsa_key
21ffa77cab9d66af80672dc9dc039bfd /etc/ssh/ssh_host_rsa_key.pub

so it has the same format of the checksum file we downloaded from GiTea website, but this time it contains a list of checksums, one per row, and the path to the files the checksum refers to, separated by a space.

You may be tempted to think that you have found a way to find out which files have been tampered by evil people, ... not quite. You are still missing the most important thing: evil people can simply regenerate the file with the list of checksums so that replacing the checksum of the original files with the checksum of the tampered file.

When dealing with security, the pillar is the trust, and we are still missing the trust on the checksum file, ... a trust that by the way we can easily get by digitally signing the checksum file with our private key: doing this way, before verifying checksums, we verify the signature of the checksum files itself - evil people is not able to sign the checksum file with our private key, ... at least until we keep it safe and sound.

If you are thinking to create a script that monitors checksum of config files to periodically check if somebody tampered them, yes,  ... it sounds good, ... so good that others already did it and much more – do not waste your time: install and configure aide or even better, … install Wazuh (https://wazuh.com/).

Checking file differences

Another very common need is checking the difference of two versions of the same file.

As an example of how to deal with this use case, let's modify SSH to disable login as root user. We can achieve this with a simple sed one-liner:

sudo sed -i.bak -r 's/^[ ]*PermitRootLogin[ ]+.*/PermitRootLogin no/' /etc/ssh/sshd_config

we must of course reload ssh service to apply it:

sudo systemctl reload sshd

Because of the -i parameter with the ".bak" trailing word, the above sed command, besides modifying the target file, creates a backup copy of the original version beneath the same path of the original one with a trailing ".bak" in the file name.

We can use diff to compare them to see the difference:

sudo diff /etc/ssh/sshd_config /etc/ssh/sshd_config.bak

the output is:

43c43
< PermitRootLogin no
---
> PermitRootLogin yes

this confirms that the only difference between the original and new file is the "PermitRootLogin" modification.

What about the md5sum we computed so far? Let's check them:

sudo md5sum -c /tmp/ssh.md5

the output is:

/etc/ssh/moduli: OK
/etc/ssh/ssh_config: OK
/etc/ssh/sshd_config: FAILED
/etc/ssh/ssh_host_ecdsa_key: OK
/etc/ssh/ssh_host_ecdsa_key.pub: OK
/etc/ssh/ssh_host_ed25519_key: OK
/etc/ssh/ssh_host_ed25519_key.pub: OK
/etc/ssh/ssh_host_rsa_key: OK
/etc/ssh/ssh_host_rsa_key.pub: OK
md5sum: WARNING: 1 computed checksum did NOT match

everything as expected:

  • look at the line "/etc/ssh/sshd_config: FAILED"
  • look at the last line with the summary report: "1 computed checksum did NOT match"

If we'd rather, we can even see the differences putting contents side by side as follows:

sudo diff -y /etc/ssh/sshd_config /etc/ssh/sshd_config.bak

it prints both files, side by side, and highlights differences by putting a pipe character (|) next to the modifications; look at the following snippet of the output of the command - please note that the very most of the line have been cut to improve readability:

# $OpenBSD: sshd_config,v 1.103 2018/04/09       # $OpenBSD: sshd_config,v 1.103 2018/04/09 20:41:22 
# This is the sshd server system-wide            # This is the sshd server system-wide
# sshd_config(5) for more information.           # sshd_config(5) for more information.
… 
#LoginGraceTime 2m                               #LoginGraceTime 2m
PermitRootLogin no                             | PermitRootLogin yes
#StrictModes yes                                 #StrictModes yes
… 
# Example of overriding settings on              # Example of overriding settings on 
#Match User anoncvs                              #Match User anoncvs
#       X11Forwarding no                         #        X11Forwarding no
#       AllowTcpForwarding no                    #        AllowTcpForwarding no
#       PermitTTY no                             #        PermitTTY no
#       ForceCommand cvs server                  #        ForceCommand cvs server

Checking directory differences

The diff command line utility can be handy also to look at differences in directory contents – for example:

diff --brief -r /tmp/foo /tmp/bar

you can view only the differences by specifying -N option:

diff --brief -Nr /tmp/foo /tmp/bar

Creating A Patch

The diff command can be used also to create a patch file: a patch file is a file containing only the differences between the compared files that can be used to restore a file to its previous version.

As an example, let's create a patch with the differences of of the "/etc/ssh/sshd_config" file we just modified and its backup copy "/etc/ssh/sshd_config.bak":

sudo diff -u /etc/ssh/sshd_config /etc/ssh/sshd_config.bak > /tmp/sshd_config.patch

we need sudo only because it is a file with restricted access.

Let's remove the backup file:

sudo rm -f /etc/ssh/sshd_config.bak

Let's see the contents of the "/tmp/sshd_config.patch" patch file:

cat /tmp/sshd_config.patch

the output is:

--- /etc/ssh/sshd_config 2021-12-08 08:07:13.351410873 +0000
+++ /etc/ssh/sshd_config.bak 2021-12-08 08:03:19.259603482 +0000
@@ -40,7 +40,7 @@
# Authentication:

#LoginGraceTime 2m
-PermitRootLogin yes
+PermitRootLogin no
#StrictModes yes
#MaxAuthTries 6
#MaxSessions 10

let's try to rollback it to its previous version:

sudo patch -i /tmp/sshd_config.patch /etc/ssh/sshd_config

again, we need sudo only because it is a file with restricted access.

The output is:

patching file /etc/ssh/sshd_config

we can of course use diff and patch to more than just one file, generating multiple patch files, than can be run as a whole using find.

Please note that using diff and patch to patch files has nothing to deal with version management - if you want to have version control you must use a Source Control Management tool such as Git. Patch files are used only to create files that let you apply a different flavor to a vanilla software.

An hands-on example

diff and patch were very popular years ago, when the available SCM were only CVS and later SVN: these tools were very handy when you wanted to enhance or bugfix a third party software - you coded your modifies and once satisfied you created a patch and submit to the project owner for review, hoping he may add the patch somehow to the project.

Nowadays with GIT, forking of repositories and pull requests things are much easier, but knowing how to deal with the old way may help if you hit very old projects.

As a hands on example, let's suppose that we want to use the "fooapp" application, but we need to modify a little bit of some of its files to adapt it to our use case.

"fooapp" is a sample application I developed for a trilogy of posts dedicated to Python, with the aim to describe how to setup a full-featured Python project, how to pack up everything with setup-tools and how to create an RPM package. I really recommend you reading those posts because they really rocks!

I stored the "python-fooporject" with the source of the "fooapp" applications along with the libraries and all the other support files (such as the RPM spec file) on GitHub and created a release. Let's download the release as follows:

wget https://github.com/mac-grimoire/python-fooproject/archive/refs/tags/release/0.0.1.tar.gz

now let's extract the files from the tarball:

tar xfvz 0.0.1.tar.gz

Before attempting any modification, let's run it to see how does it work:

pushd python-fooproject-release-0.0.1/src/bin
./fooapp.py

the output is as follows:

Print the list:
Name: RedHat, Enabled: True
Name: Suse, Enabled: True
Name: CentOS, Enabled: False
Print the list in ascending order:
Name: CentOS, Enabled: False
Name: RedHat, Enabled: True
Name: Suse, Enabled: True
Print the list - after the removal of CentOS:
Name: RedHat, Enabled: True
Name: Suse, Enabled: True
Name: Rocky Linux, Enabled: False
Print the list in descending order:
Name: Suse, Enabled: True
Name: Rocky Linux, Enabled: False
Name: RedHat, Enabled: True

so it basically prints a list of Linux distributions.

We want to improve it by adding an attribute to store the version of the distribution too, so to better identify them.

Let's get back to the old working directory:

popd

We begin by copying the original files into a new directory tree where we'll develop our modified version:

cp -dpR python-fooproject-release-0.0.1 python-fooproject-myfancy

change directory to "python-fooproject-myfancy":

pushd python-fooproject-myfancy

now we are on the root of the directory tree we are about develop into: modify its files as by the following snippets:

File: src/carcano/foolist/foolistitem.py

__author__ = "Marco Antonio Carcano"
__version__ = '1.0.0'

import functools


@functools.total_ordering
class FoolistItem():
    """
    Object used as Item into Foolist class
    """
    _name = ""
    _version = ""
    _enabled = False
    _next = None

    def __init__(self, name, version, enabled=False):
        """
        Initalize an Item assigning a name. You can optionally assign a value
        to the "enabled" boolean attribute, that defaults to False

        :param name: the name to assign to this FoolistItem Object
        :param version: the version to assign to this FoolistItem Object
        :param enabled: a flag to mark this FoolistItem Object as enabled
                        or not
        """
        self._name = name
        self._version = version
        self._enabled = enabled
        self._next = None

    def __str__(self):
        """
        Implements the representation of the Item

        :returns: pretty-print of this FoolistItem Object
        """
        return f"Name: {self._name}, Version: {self._version}, Enabled: {self._enabled}"

    def __repr__(self):
        """
        Implements the representation of the Item

        :returns: an unambiguous representation of this FoolistItem Object
        """
        return f"Id: {id(self)}, Name: {self._name}, Version: {self._version}, Enabled: {self._enabled}"

    @property
    def next(self):
        """
        a reference to the next Item when used into a list
        """
        return self._next

    @next.setter
    def next(self, node):
        self._next = node

    @property
    def name(self):
        """
        the name of this FoolistItem Object
        """
        return self._name

    @name.setter
    def name(self, name):
        self._name = name

    @property
    def version(self):
        """
        the version of this FoolistItem Object
        """
        return self._version

    @version.setter
    def version(self, version):
        self._version = version

    @property
    def enabled(self):
        """
        wether or not this FoolistItem Object is enabled
        """
        return self._enabled

    @enabled.setter
    def enabled(self, enabled):
        self._enabled = enabled

    def __lt__(self, other):
        """
        Lower_than comparison implementation
        """
        return self.name < other.name

    def __eq__(self, other):
        """
        Equal_to comparison implementation
        """
        if other is not None:
            return (self.name) == (other.name) and (self.version) == (other.version)
        else:
            return (self.name) is None

File: src/carcano/foolist/foolist.py

__author__ = "Marco Antonio Carcano"
__version__ = '1.0.0'

from .foolistitem import FoolistItem

import logging
"""
initialize logger to NullHandler
"""
log = logging.getLogger(__name__)
log.addHandler(logging.NullHandler())


class Foolist():
    """
    Object used to implement a list of FoolistItem
    """
    _unique = False

    def __init__(self):
        """
        Initialize the list by setting its head to None
        """
        self.head = None
        self._unique = False

    @property
    def unique(self):
        """
        if true, set the list so to contain only unique items
        """
        return self._unique

    @unique.setter
    def unique(self, isunique):
        log.debug('Package: '+__name__+', setting unique='+str(isunique))
        self._unique = isunique

    def __iter__(self):
        """
        Iterate throughout the list
        """
        node = self.head
        while node is not None:
            yield node
            node = node._next
        raise StopIteration

    def append(self, name, version, enabled=False):
        """
        Append a FoolistItem to the list

        :param name: the name to assign to the FoolistItem
        :param version: the version to assign to the FoolistItem
        :param enabled: a flag to mark it as enabled or not
        """
        log.debug('Package: ' + __name__ +
                ',Method: append, Params: name="' + name +
                '", version="' + version + '", enabled: ' +
                str(enabled))
        if self.head is None:
            self.head = FoolistItem(name, version, enabled)
            return
        if self._unique is True:
            tmp_node = FoolistItem(name, version, enabled)
        for current_node in self:
            if self._unique is True and current_node == tmp_node:
                log.debug('Package: ' + __name__ +
                        ',Method: append, Msg: skipping since FoolistItem is already present')
                return
            pass
        current_node.next = FoolistItem(name, version, enabled)

    def remove(self, name, version):
        """
        Removes an item from the list

        :param name: the name of the FoolistItem to remove
        :param version: the version of the FoolistItem to remove
        """
        log.debug('Package: '+__name__+', Method: remove, Params: name="'+name +
                '", version="' + version + '"')
        for f in self:
            if f.name == name and f.version == version:
                if f == self.head:
                    self.head = f._next
                else:
                    hold._next = f._next
                del f
                break
            else:
                hold = f

    def __str__(self):
        """
        Implements the representation of the list
        """
        nodes = []
        for node in self:
            nodes.append(str(node))
        return str(nodes)


    def __repr__(self):
        """
        Implements the representation of the list
        """
        nodes = []
        for node in self:
            nodes.append(repr(node))
        return str(nodes)

File: src/bin/fooapp.py

#!/usr/bin/env python3
from carcano.foolist import *

import os
import logging
import logging.config

log_config_paths = [
    os.path.dirname(os.path.realpath(__file__))+'/logging.conf',
    '/etc/fooapp/logging.conf']

log_enabled = False

for log_config_file in log_config_paths:
    if os.path.isfile(log_config_file):
        log_enabled = True
        break
if log_enabled is True:
    logging.config.fileConfig(
        fname=log_config_file,
        disable_existing_loggers=False
    )
    logger = logging.getLogger(__name__)
    logging.info(__file__+': started')

os_list = Foolist()
os_list.unique = True
os_list.append('RedHat', '8.0', True)
os_list.append('Suse', '11.0', True)
os_list.append('CentOS', '8.0', False)
os_list.append('CentOS', '7.0', False)
print("Print the list:")
for os in os_list:
    print(os)
print("Print the list in ascending order by name:")
for os in sorted(os_list):
    print(os)
os_list.remove('CentOS', '8.0')
os_list.append('Rocky Linux', '8.0', False)
print("Print the list - after the removal of CentOS 8.0:")
for os in os_list:
    print(os)

print("Print the list in descending order by name:")
for os in sorted(os_list, reverse=True):
    print(os)
if log_enabled is True:
    logging.info(__file__+': finished')

File: src/test/test_foolist.py

#!/usr/bin/env python3
import unittest
from collections.abc import Iterable
from carcano.foolist import *


class Foolist(unittest.TestCase):
    os_list = Foolist()

    def testFoolistIsIterable(self):
        self.assertIsInstance(self.os_list, Iterable)

    def testFoolistAppend(self):
        self.os_list.append('RedHat', '8.0', True)
        self.os_list.append('CentOS', '7.0', False)
        self.os_list.append('Suse', '11.0', True)
        self.assertIn(FoolistItem('CentOS', '7.0', False), self.os_list)

    def testFoolistRemove(self):
        self.os_list.remove('CentOS', '7.0')
        self.assertNotIn(FoolistItem('CentOS', '7.0', False), self.os_list)


if __name__ == '__main__':
    unittest.main()

here we finished modifying the original project; let's pop back to the original directory:

popd

since in the future we may want to create other patches besides this one, so to be tidy we need to create a directory where to store the patch files:

mkdir -p patches

Iet's have a look to the contents of our current working directory:

ls -d1 *

the output is as follows:

patches
python-fooproject-myfancy
python-fooproject-release-0.0.1

We are eventually ready to create the patch - simply type the following command:

diff --no-dereference -uarN python-fooproject-release-0.0.1 python-fooproject-myfancy > patches/fancymod.patch

I used a set of options that you can safely use in any circumstance, since they take care of new files, of binary files and of symlinks. 

Let's try the patch: we need the original project so to apply it.  in order to do so, we have to remove the directory tree containing the files we just developed ("python-fooproject-myfancy") and recreate it with the original files:

rm -rf python-fooproject-myfancy
cp -dpR python-fooproject-release-0.0.1 python-fooproject-myfancy

We are eventually ready to apply the patch: let's change directory

cd python-fooproject-myfancy

and apply the patch using the patch command:

patch -p1 < ../patches/fancymod.patch

the output is as follows:

patching file src/bin/fooapp.py
patching file src/carcano/foolist/foolistitem.py
patching file src/carcano/foolist/foolist.py
patching file src/test/test_foolist.py

as you see, it prints each of the files that gets successfully patched.

Lastly we can try "fooapp" to see how it does work after applying the patch:

cd src/bin
./fooapp.py

the output is as follows:

Print the list:
Name: RedHat, Version: 8.0, Enabled: True
Name: Suse, Version: 11.0, Enabled: True
Name: CentOS, Version: 8.0, Enabled: False
Name: CentOS, Version: 7.0, Enabled: False
Print the list in ascending order by name:
Name: CentOS, Version: 8.0, Enabled: False
Name: CentOS, Version: 7.0, Enabled: False
Name: RedHat, Version: 8.0, Enabled: True
Name: Suse, Version: 11.0, Enabled: True
Print the list - after the removal of CentOS 8.0:
Name: RedHat, Version: 8.0, Enabled: True
Name: Suse, Version: 11.0, Enabled: True
Name: CentOS, Version: 7.0, Enabled: False
Name: Rocky Linux, Version: 8.0, Enabled: False
Print the list in descending order by name:
Name: Suse, Version: 11.0, Enabled: True
Name: Rocky Linux, Version: 8.0, Enabled: False
Name: RedHat, Version: 8.0, Enabled: True
Name: CentOS, Version: 7.0, Enabled: False

As you see, the patch worked, and we have now a more detailed outcome that takes also in account of the version of the Linux distribution.

Footnotes

Here it ends this tutorial on verifying checksums and comparing different file versions: I hope you enjoyed it and that it helped you to better understand this topic.

Writing a post like this takes hours. I'm doing it for the only pleasure of sharing knowledge and thoughts, but all of this does not come for free: it is a time consuming volunteering task. This blog is not affiliated to anybody, does not show advertisements nor sells data of visitors. The only goal of this blog is to make ideas flow. So please, if you liked this post, spend a little of your time to share it on Linkedin or Twitter using the buttons below: seeing that posts are actually read is the only way I have to understand if I'm really sharing thoughts or if I'm just wasting time and I'd better give up.

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>