Ansible inventory best practices: caveats and pitfalls

Ansible is an extremely powerful data center automation tool: most of its power comes from not being too strict into defining a structure - this enables it to be used into extremely complex scenarios as well as to very quickly set it up in quite trivial scenarios.

But this is a two edged sword: too many times I saw POC for adopting it permed POC with too poor requirements, thinking they can reuse what they experimented as a baseline for structuring Ansible: this is a very harmful error that quickly lead to unmaintainable real life environments with duplicated code and settings, often stored into structures without a consistent logic or naming, so losing the most of the benefits of such a great automation tool.

Ansible inventory best practices: caveats and pitfalls is the post from where we begin exploring how to properly structure Ansible to get all of its power without compromises, structuring things in an easy and straightforward way suitable for almost every operating scenario.

Gather The Requirements

This post develops itself as if it was you investigating how to properly structure ti Ansible inventory: this is an additional flavour I absolutely wanted to add to this post, since it provides also a good example of the correct approach to use when doing a product analysis and Proof Of Concepts - that's the reason of the "Gather The Requirements" paragraph, ... so I hope you like this additional effort I made for this too.

As solutions architects must know, the very first thing for evaluating a product is not to start using it or reading trivial tutorials - this is the perfect recipe for delivering a mess.
The very first thing to do is define the requirements.

Use case Scenarios

The only way to define requirements is having a clear vision of the use cases.

Let's start from answering the question "What":

What

Configuring infrastructural services, this spans from configuring the OS environment of Virtual Machines and bare metal to configuring services such as Database Engines, Application Servers, Web Servers, Load Balancers and even Networking Devices or Appliances
Delivering Services: mind that a service is the sum of single configuration items such as Database Instances, Virtual Host, Application Server instance, GIT Repository, but it also has infrastructural requirements, such as firewall exceptions, that must be addressed before its delivery. In addition to that, an automation tool must also implement rolling releases with strategies such as canary releases or blue-green.

We must also answer “Who”, so the actors - this is very important for adding all the necessary security measures into the design.

Who

In this fictional example, users are from the below teams:

Networking Team (operate only networking devices)
IT Operation Team (operate both Virtualization Environment, the OS environment, Application Server and Load Balancers)
Database Team (operate only databases)
Security Team (manages the PKI, Identity Services, Proxies and Web Application Firewalls)
Services Team (specialists about each service: delivering, monitoring, troubleshooting, ….)

The next question is “When”.

When

some of them operate during working days, others 24x7: since the solution we are delivering (Ansible) must work for all of them, the answer to this question, that actually is the availability level, is “24x7”.
Last question for defining requirements is “Why” - the question "How" is answered by the actual solution, so it is not part of the requirements.

Why

Because we want to improve the time-to-market when delivery of service, using a solution that is straightforward enough to limit human errors to the bare minimum and promote a quick onboarding of new staff.

Underestimating the use case as well as not elaborating enough the structure "because anyway we don't need this now" leads to not properly engineering the solution: the outcome after some times is always rework, structures and naming with a lot of nested exceptions hard to remember and manage, code and settings duplications, ... in a word: a mess. Sadly there are people not good into do a properly engineering that accuse the good ones of every engineering, and are often supported by the management because they give the impression to be quick and effective: the outcome can be seen by all - just as an example: Study finds 268% higher failure rates for Agile software projects (I don't think the problem is Agile itself).

Now that we have the requirements, we can start investigating the tool's features to see if they are enough and how to use them to match the requirements.

Ansible Configurations Sources - A Quick Walkthrough

Since the Inventory is not the only configuration source, in order to learn how to structure it we must of course know which are the others, how do they work and how do they relates to the Inventory: only this way we can make a proper guess drawing up an optimal structure that do not overlap or even worse clash with the other configuration sources.

The Ansible Inventory

The Ansible Inventory is a configuration source capable of providing both topological and playbook settings configurations: it provides them as a document, most of the time formatted as INI or YAML. The actual source can be a single file (often called "hosts", or "hosts.yml" as well as the result of concatenating multiple files within a directory: it is even possible to use an application or script as inventory files - this enables generating inventories on the fly based on subsets of the CMDB that make sense for each specific run, a feature exploited for example by Katello (the Red Hat Network Satellite Server's upstream project).

The inventory contains

Target Hosts
Hostgroups
Variables

Target Hosts

Any host listed in the inventory is called target host,

An inventory can be as simple as follows:

localhost
pgsql-ca-up1a001
git-ca-up1a001

In this very trivial inventory we declare 2 target hosts ("pgsql-ca-up1a001" and "git-ca-up1a001") along with the "localhost" target.

In real life things are of course much more complex, and Inventories also define groups of target hosts..

Hostgroups

Entries in the inventory contained into square brackets (stanza definitions in the INI format terms) are used to define groups of target hosts - they are called hostgroups.

Mind that if the hostgroups' members are hostgroups, it is necessary to use the ":children" suffix to the hostgroup's name - we will see this in action soon.

Hostgroup's purpose is twofold:

they are used to provide targets while running Ansible avoiding to list hosts one by one
they are used to bound variables - binding variables to hostgroups binds them to every target host that is direct or indirect member of that hostgroup, sparing from having to manually listing them for each single host - we will discuss this shortly

Figuring out a good strategy for defining hostgroups looks trivial, but it isn't.

Hostgroups defined in the inventory are of course static, but mind Ansible also provides the "add_host" and "group_by" modules that enable the generation of dynamic hostgroups on the fly. We will see how useful and handy they are soon..

Host_vars

Variables provided by the inventory are called host_vars: the most common and easy way for declaring them is listing them into a file with the same name of the target host the variables are bound to, storing that file beneath the "host_vars" directory within the inventory's directory tree.

The following snippet for example defines the "ansible_connection" host_var with the value "local":

ansible_connection: local

It must be put in the "environment/host_vars/localhost.yml" file so to have it assigned only to the "localhost" target host. With this variable set like so, Ansible does not use any kind of connection while running plays affecting "localhost" as target: it just runs the tasks.

Group_vars

To avoid unnecessary and cumbersome to maintain repetitions, it is possible to list variables common to every host member (directly or indirectly) of a specific hostgroup by putting them into a file with the name of that hostgroup, stored beneath the "group_vars" directory - these variables are commonly called group_vars indeed.

Mind of the special built-in group "all": its purpose is to define variables for every target host in the inventory..

For example, we can assign a variable with the name of the current environment to every host in the inventory by adding to the following contents to the "environment/group_vars/all.yml" file:

environment: lab

Be wary of precedence - variables with the same name declared "closer" to the host in an host_groups hierarchy have precedence over the more outer ones!

Listing The Ansible Inventory's Contents

If necessary, it is possible to inspect the contents of an Inventory processed by Ansible by running:

ansible-inventory --list -i /ansible/environment

or, if you prefer to display the generated graph:

ansible-inventory --graph --vars -i /ansible/environment

Vars Files

The second most used configuration source are vars_files: these files can be used to provide variables that are specific for each run - for example re-defining performance settings of a service managed by Ansible for performance tuning. These YAML formatted files containing variables that are be loaded during the play.

It is possible to encrypt single vars_files using the "ansible-vault" command line utility: this method is used for example to create vars_files containing only sensitive variables (someone refers to this kind of files as "secrets files")

Other Configuration Sources

Other ways of providing configuration to Ansible are:

command line variables: these variables are set while running the "ansible-playbook" statement
lookups: this method is used to get values from third party external services, such as secrets
environment variables: these are variables fetch from the shell environment using lookups

Facts

Ansible also has some special variables that are called "facts". These special variables can be:

host facts: they are gathered when connecting to the host, or fetched from a cache containing the previously discovered ones
runtime facts: they are defined - or redefined on the fly somewhere during the play (same way as you define and redefine variables when working with any programming language)
local facts: they are loaded from JSON or INI formatted files stored beneath the target host's "/etc/ansible/facts.d" directory

Both host facts and local facts are guessed by the "setup" Ansible module: if they are altered during a play - for example by changing the contents of the "/etc/ansible/facts.d" or after starting a service they can be refreshed by re-running the "setup Ansible module".

Ansible Configuration Objects

Now that we know the configuration sources, we can focus on which are the configuration objects and formats supported by Ansible - these are:

strings - mind that strings containing only number or only booleans are automatically mapped to numbers or booleans
dictionaries
list
lists of dictionaries

These objects are most of the time provided using YAML format, despite JSON and INI formats being supported if necessary.

Ansible requires to be strongly skilled in YAML - YAML format is not just as trivial as it may look at first glance: my heartfelt advice is also to have you reading the "YAML in a Nutshell" post, since it provides everything you must know for properly working with YAML. It won't be bad even to read "JSON and jq in a nutshell".

The Solution

We are ready to draw a solution that must fit every requirement we defined so far: instead of just describing the solution, in this paragraph I'm providing a fully functional example, describing the rationales behind each specific choice.

A good use case to hit all every requirement is delivering PostgreSQL on target hosts. This use case indeed requires :

to be able to specify the PostgreSQL version to install
to be able to provide host specific performance tweaks
to be able to provide cluster specific topological settings, such as rules for the system firewall
to be able to create database specific for single deliverables

As by best practices, instead of writing a full playbook with all the necessary tasks on our own, we are first having a look into the online Ansible Galaxy to see if there's any already available well supported Ansible role.

A quick check shows that it exists the "galaxyproject.postgresql" Ansible role: since it looks like an official (and so well maintained) one, we can just use this shelf Ansible role, including it in the playbook we are writing.

Let's install the "galaxyproject.postgresql" Ansible role using the "ansible-galaxy" command line utility as follows:

ansible-galaxy role install -p /ansible/roles galaxyproject.postgresql

the above statement runs the "ansible-galaxy", specifying to install the downloaded role in the "/ansible/roles" directory within the container ("-p" command line parameter).

After installing it, have a look to the "ansible/roles/galaxyproject.postgresql/README.md" file to see the variables that can be passed to the role and that so must be put into the configuration structure we are about to describe and implement.

Initial Bare-minimum Configuration

We must of course start with the bare minimum configuration - first we must create the inventory's directory:

mkdir -m 755 ansible/environment

once done, we configure the bare minimal "host_vars" - create the host_vars sub-directory:

mkdir -m 755 ansible/environment/host_vars

and define the "ansible_connection" host_var only for the "localhost" target host - just create the "ansible/environment/host_vars/localhost.yml" file with the following contents:

ansible_connection: local

as we said, this is a special variable used to tell Ansible not to use any kind of connection while running plays affecting localhost as target, but just run the tasks.

Addressing IT Operations And Networking Team's Needs

Let's start by seeing how to address the IT Operations team and the Networking team common needs.

Infrastructure's Slice Targets

The first need is to have meaningful targets that enable running statements on a large scale.

By the operational perspective, an infrastructure can be typically sliced into different purpose-specific ways:

target hosts belonging to the same availability zone
target hosts belonging to the same datacenter
target hosts belonging to the same cluster

All of the above can be further grouped by os family.

Avoid being implicit! One thing you may tempted to do is avoid using environment placeholder in target and topological variable names - the *apparent* benefit is that target names, such as host_groups and topological variables, are the same across the environments, enabling to run exactly the same ansible-playbook statements and apparently easing promotions across the environment (every variables name look like the same) - this is very bad an anti-pattern that will prevent you to run cross-environment automations. Of course these are quite rare use cases, but they happen: think for example when doing a database roll over of a service from production to test, back porting the database (of course anonymizing sensitive data).

When doing operational tasks it can happen relatively often to perform operations on hosts belonging to these slices or to subsets generated by the intersections of them.

The best practice is, besides defining host_groups for every cluster (or you won't be able to define group_vars), providing also host_groups for each of the above kind of slices: this enable to be extremely quick when performing time sensitive operations such as shutting down every host of an availability zone during an incident, quickly cordoning a datacenter, and so on.

As an example, create the "ansible/environment/hosts" inventory file with the following contents:

localhost

# security tier: 1, environment: prod, os: unix/linux, svc: load-balancers, cluster: 0
[lb_ca_up1]
lb-ca-up1a001
lb-ca-up1b002

# security tier: 1, environment: prod, os: unix/linux, svc: git, cluster: 0
[git_ca_up1]
git-ca-up1a001
git-ca-up1b002

# security tier: 1, environment: prod, os: unix/linux, svc: postgresql, cluster: 0
[pgsql_ca_up1]
pgsql-ca-up1a001
pgsql-ca-up1b002
pgsql-ca-up1c003

# security tier: 1, environment: prod, os: unix/linux, availability-zone: a
[ca_up1a]
pgsql-ca-up1a001
git-ca-up1a001
lb-ca-up1a001

# security tier: 1, environment: prod, os: unix/linux, availability-zone: b
[ca_up1b]
pgsql-ca-up1b002
git-ca-up1b002
lb-ca-up1b002

# security tier: 1, environment: prod, os: unix/linux, availability-zone: c
[ca_up1c]
pgsql-ca-up1c003

# os: unix/linux, environment: prod, availability-zone: a
[ca_upNa:children]
ca_up1a

# os: unix/linux, environment: prod, availability-zone: b
[ca_upNb:children]
ca_up1b

# os: unix/linux, environment: prod, availability-zone: c
[ca_upNc:children]
ca_up1c

# os: unix/linux
[ca_up:children]
ca_upNa
ca_upNb
ca_upNc

As said, we can limit the actual targets by specifying the "--limit" clause using pattern matching and booleans.

To test it and so verify the actual target we can run the "ansible" statement with the "--list-hosts" option.

For example, to limit the targets to every unix machine in the "ca" datacenter not belonging to the availability zone "a":

ansible all -i /ansible/environment/hosts --list-hosts -l 'ca_up:!ca_upNa'

the output is as follows:

  hosts (4):
    lb-ca-up1b002
    git-ca-up1b002
    pgsql-ca-up1b002
    pgsql-ca-up1c003

You can learn more on the Ansible's pattern matching in the "Patterns and ad-hoc commands" documentation page.

Target Specific Settings

A very common practice - not only in Ansible - is to assign labels to objects. I strongly recommend doing this because it provides several benefits, including implementing safety measures to prevent playbooks form running tasks on wrong hosts by mistake.

For example, it is possible to create the "hosts_labels" list for the "pgsql-ca-up1a001" host and assign the "postgresql" label - just create the "ansible/environment/host_vars/pgsql-ca-up1a001.yml" file with the following contents

host_labels:
  - postgresql

Repeat this step for every PotgreSQL server, such as "pgsql-ca-up1b002" and "pgsql-ca-up1c003".

To exploit it, it is enough to add a "when" condition in the PostgreSQL related playbook to run tasks only when the "hosts_labels" list contain the "postgresql" label.

Both the IT Operations Team and the Networking team have the need of delivering settings typically related to the infrastructure's topology, such as

network devices settings
operating system settings (sysctl tweaks, system-firewall rules, corporate-wide PKI's trust-stores, ...)
service-specific settings, such as performance tweaks, cluster-level settings and so on

These settings are then merged with templates to generate the actual managed configuration files.

This kind of settings are perfect candidates for being stored within the inventory as host_vars or group_vars.

Services instances' specific settings, such as virtual hosts, database instances, dedicated certificates and such must be provided by using vars_files - not only putting them here just makes your inventory growing, but it is also a poor and ineffective way of designing the delivery of your services: scattering settings among the inventory means you have to spend time to gather them back each time you need to make sense of the overall settings of a delivery before making a change - it is far optimal having all the settings of a deliverable grouped together so you can get them in a snapshot: we'll see soon how this can be easily achieved by using var files.

to see it in action, create the "group_vars" sub-directory as follows:

mkdir -m 755 ansible/environment/group_vars

In our Lab, as an example, we install PostgreSQL 14: - a quick look to the "ansible/roles/galaxyproject.postgresql/README.md" file shows us that the PostgreSQL version to install is set by the "postgresql_version" variable, so we assign the "postgresql_version" group_var to "14" to the "pgsql-ca-up1" hostgroup so to have it inherited by every host belonging to it - just create the "ansible/environment/group_vars/pgsql_ca_up1.yml" file with the following contents:

postgresql_version: 14

We can of course also add other settings, such as the system firewall rules:

firewall:
  - # from subnet apps_p1_a_s0 to service pgsql
    # ticket: NET-54271
    rule: apps_p1_a_s0_to_pgsql
    src_ip: 192.168.254.0/24
    service: postgresql
    action: accept
    state: enabled

in this example we are granting access to the PostgreSQL service from the whole 192.168.254.0/24 subnet.

Note how, as by best practices, we are adding also tracking informations about the ticket that was implements (NET-54721): this way we are keeping everything clear and easy in the event of an audit request when investigating security incidents

Addressing Database Administrators Team's Needs

Database Administrators very often do not really get along with automation tools such as Ansible (I'm not saying all of them) - in my personal experience the most of them like to operate the old way directly on their system.

A way to let them operate as they like thus yet having an automation is agreeing that, instead of directly operating the database engine configuration file, they must modify a configuration file that is used by Ansible to generate the actual configuration file used by the Database engine, and to just run the playbook to apply the changes.

This can be achieved by setting local facts directly on the database hosts.

Local Facts

Local facts are JSON formatted files containing settings that Ansible can read directly from the target host. It is possible to provide them JSON formatted files compatible with the Ansible roles, so that the only thing the Database Administrators have to do is configuring these JSON file and ask someone of the IT operations team to run Ansible on behalf of them.

This approach is often a win-win:

the central inventory does not grow with these settings
DBA does not need to learn how to write Ansible roles and playbooks: they just need to learn the syntax of these JSON files

As an example, we can create some local facts on the "pgsql-ca-up1a001" host: connect to the host and create the "/etc/ansible/facts.d" directory:

sudo mkdir -m 755 /etc/ansible /etc/ansible/facts.d

now create the "/etc/ansible/facts.d/postgresql.fact" facts file with the following contents:

{
  "conf": [
    { "listen_addresses": "'*'" },
    { "max_connections": 50 }
  ],
  "pg_hba_conf": [
    "host all all all md5"
  ]
}

we can check the outcome by running:

ansible pgsql-ca-up1a001 -m ansible.builtin.setup -a "filter=ansible_local"

Playbooks

Playbooks are a complex topic that deserves a thorough explanation in a dedicated post, again talking about best practices, caveats and pitfalls - there's no room for a bottom up approach without a well defined standard, especially about naming and structure, unless you like working in a mess (I saw a lot of mess around, sadly).

Playbooks can be classified by grouping them by purpose - there exists:

delivery targeted playbooks - these are playbooks aimed at delivering configurations (such as firewall rules) or configuration items (such as database schemas) to existing services
deploy targeted playbooks - these are playbooks aimed at deploying services

These playbooks can be further classified as follows:

- playbooks aimed at deploying a service without dependencies
- playbooks aimed at deploying a service with dependencies - for example an application that requires a database schema. In this case it is very convenient developing them so that they are configured using a blueprint: this eases configuration management, since everything is in the same place and the configuration structure itself enables an easy understanding of the overall service details. In addition to that, it makes it very easy to deploy other service instances, by simply creating a copy of the blueprint and modifying the settings as necessary.

The "Ansible playbooks best practices: caveats and pitfalls" post goes through all of this in details.

Footnotes

Structuring the Ansible inventory the proper way is the first step for proficiently working with Ansible - in this post we saw an example that addresses a quite complex use in a real life case. But as usual things must always be tailored on the specific needs, so always use your own brain - take enough time to gather requirements, challenge them, define standards and then do a proper design.

I saw a lot of semi-official and even official and blazoned framework that are very specific in tracking the progress of the development (“more governance for everyone!” - TM), completely forgetting the design process (they just pass from gathering users stories directly to development) - use that approach with Ansible, and you will soon realize how fragile and risky is blindly following them.

Writing a post like this takes a lot of hours. I'm doing it for the only pleasure of sharing knowledge and thoughts, but all of this does not come for free: it is a time consuming volunteering task. This blog is not affiliated to anybody, does not show advertisements nor sells data of visitors. The only goal of this blog is to make ideas flow. So please, if you liked this post, spend a little of your time to share it on Linkedin or Twitter using the buttons below: seeing that posts are actually read is the only way I have to understand if I'm really sharing thoughts or if I'm just wasting time and I'd better give up.