Overlay networking enables to implement tunnels to interconnect networks defined inside a host (such as Docker/Podman private networks): for example flannel based Kubernetes uses VxLANs to interconnect the Minion’s private networks. Anyway VxLAN is only one of the available technologies: other technologies such as GENEVE, STT or NVGRE are available.
In this post we setup a GENEVE tunnel with OpenVSwitch and Podman - the described set up goes beyond the simple interconnection on of layer 3 network segments, interconnecting two Podman’s private networks configured with the same IP subnet (so they share the same broadcast domain) - the layer 2 data are exchange between the OpenVSwitch bridges on the two hosts through the GENEVE tunnel.
Overlay Networks
Overlay networking is a technology that enables interconnecting network segments by encapsulating their traffic (from layer 2 to the top of the network stack) into layer 4 packages of an existing TCP/IP network (the underlay network). This means that the two network segments, interconnected by the Overlay networking technology, share the same broadcast domain.
The most broadly used Overlay networks technologies are:
VxLAN
Mostly sponsored by Cisco, VMware, Citrix, Red Hat, Arista and Broadcom, it relies on UDP protocol using port 4789. It lets you define up to 6 millions of virtual networks that are identified by their own VNID (Virtual Network Identifier). A VxLAN distributed vSwitch consists of a whole big distributed switch that spreads across several switches: all of the switches that compose the distributed vSwitch are connected through the VxLAN tunnels that connect Virtual Tunnel EndPoints (VTEPs) to each other. This is achieved by subscribing to the same IGMP multicast group. It is worth noting that it does not require the Spanning Tree Protocol (STP), since it implements loop prevention by itself.
Mind that the switching tables of each switch that is part of the distributed vSwitch are not shared: each distributed switch element has its own - that is: the control plane is not implemented.
A switching table entry looks like:
MAC=VNID:VTEP:VPort
A VxLAN switch behaves exactly like traditional switches: when a packet with an unknown source MAC address is received, it stores it as belonging to the port it's coming from. When an unknown destination MAC address is reached, it floods it except through the source port.
NVGRE
Network Virtualization using Generic Routing Encapsulation is mostly sponsored by Microsoft, Arista Networks, Intel, Dell, Hewlett-Packard, Broadcom and Emulex. It relies on the GRE protocol. Same way as VxLAN, it allows up to 16 millions of virtual networks that are identified by their own TNI (Tenant Network Identifier), each one with its own GRE tunnel.
Conversely from VxLAN, NVGRE supports MTU discovery to dynamically reduce packet MTU of intra-virtual-network packet sizes.
In addition to that, conversely from VxLAN, NVGRE does not rely on flood and learn behavior over IP multicast, which makes it a more scalable solution for broadcasts, but this is a two edged sword, since this makes it an hardware/vendor dependent.
The main disadvantage of NVGRE over VxLAN probably is that in order to provide flow-level granularity (needed to take advantage of all bandwidth) the transport network (for example the router) should lookup for the Key ID in the GRE header: this is why in order to enhance load-balancing the draft suggests the use of multiple IP addresses per NVGRE host, which will allow for more flows to be load balanced. This is difficult to implement, so probably this is the main disadvantage of NVGRE over VxLAN.
STT
Stateless Transport Tunneling has been supported by VMWare: it offers good performances – since we are talking about Linux, talking about this technology would be off topic.
VxLAN is probably the most broadly used protocol: besides Linux and VMWare, it has been implemented in several routers and switches by Arista, Brocade, Cisco, Cumulus, DELL, HP, Huawei, Juniper, OpenvSwitch, Pica8 and there are several Network Interface Cards (NICs) that implement TCP Segmentation Offload such as Broadcom, Intel (Fulcrum), HPE Emulex (be2net), Mellanox (mlx4_en and mlx5_core) and Qlogic (qlcnic).
Geneve
To address the perceived limitations of VxLAN and NVGRE VMWare, Microsoft, Red Hat and Intel proposed the Generic Network Virtualization Encapsulation (GENEVE): it has been designed taking cue from many mature and long-living protocols such as BGP, IS-IS and LLDP.
The outcome is an extensible protocol (thanks to the Variable length Options field), that can transport whatever protocol (thanks to the Protocol Type field) so that it can fit every future need.
Note that parsing the option field is mandatory: OAM information and Critical flags are stored here indeed. Its header format has been carefully designed to let NIC perform TSO, although a few of them already implemented it for GENEVE.
GENEVE has its own registered protocol number and uses UDP/6081 IANA registered port. Same way as VxLAN, it uses Virtual Network Identifier (VNI).
It is worth mentioning that Wireshark dissectors are already available.
The following example shows the statements necessary to setup a GENEVE Virtual Network that spans across two different hosts using OpenVSwitch (as-ca-ut1a001 - IP 192.168.0.10 and as-ca-ut1a002 - IP 192.168.0.211):
On the as-ca-ut1a001 host:
ovs-vsctl add-port ovs_net1 to_as-ca-ut1a002 -- set interface to_as-ca-ut1a002 \
type=geneve options:remote_ip=192.168.0.11
On the as-ca-ut1a002 host:
ovs-vsctl add-port ovs_net1 to_as-ca-ut1a001 -- set interface to_as-ca-ut1a001 \
type=geneve options:remote_ip=192.168.0.10
We will see the above statements in action in the Lab we are about to set up.
Provision The Lab
The following table summarizes the networks we are about to set up:
Name
Subnet CIDR
Domain
Description
Management Network
depends on your setup
mgmt-t1.carcano.local
We use the default network the VMs gets attached to as a fictional management network - the actual configuration of this network depends on the Hypervisor you are using
Core Network Testing Security Tier 1
N.A.
N.A.
This is a trunked network used to transport the Testing VLANs: Vagrant will set-up it as "192.168.253.0/24", but the VM will use it only as a network segment to transport VLANs.
Application Servers Network Testing Security Tier 1
192.168.0.0/24
as-t1.carcano.local
This network used for attaching Application Servers VMs of the Testing environment.
This table summarizes the VM's homing on the above networks:
Hostname
Services Subnet/Domain(s)
Management Subnet/Domain
Description
as-ca-ut1a001
as-t1.carcano.local
mgmt-t1.carcano.local
the first Test Security Tier 1 environment's Application Server - in this post we only install Podman on it and setup a Podman's private network. We then set up a GENEVE tunnel to link this private network to Podman's private network on the as-ca-ut1a002 host..
as-ca-ut1a002
as-t1.carcano.local
mgmt-t1.carcano.local
the second Test Security Tier 1 environment's Application Server - also here we only install Podman on it and setup a Podman's private network. We then set up a GENEVE tunnel to link this private network to Podman's private network on the as-ca-ut1a001 host.
Deploying Using Vagrant
In order to add the two above VM, it is necessary to extend the Vagrantfile shown in the previous post a by adding the "as-ca-ut1a002" VM to the "host_vms" list of dictionaries. This can be accomplished by adding the following snippet:
{
:hostname => "as-ca-ut1a002",
:domain => "netdevs.carcano.local",
:core_net_temporary_ip => "192.168.253.13",
:services_net_ip => "192.168.0.11",
:services_net_mask => "24",
:services_net_vlan => "100",
:summary_route => "192.168.0.0/16 192.168.0.254",
:box => "grimoire/ol92",
:ram => 2048,
:cpu => 2,
:service_class => "ws"
},
finally provision the VMs by simply running:
vagrant up as-ca-ut1a001 as-ca-ut1a002
Update Everything
As suggested by best practices, it is always best provisioning something that is as current as possible.
SSH connect to the "as-ca-ut1a001" VM as follows:
vagrant ssh as-ca-ut1a001
then switch to the "root" user again:
sudo su -
update the system using DNF
dnf -y update
reboot the VM
shutdown -r now
Install The OpenVSwitch (OVS) Kernel Module
Since the Vagrant box provided by Oracle is missing the OpenVSwitch kernel module, we must install it - SSH connect to the "as-ca-ut1a001" VM as follows:
vagrant ssh as-ca-ut1a001
then switch to the "root" user again:
sudo su -
the OpenVSwitch kernel module is provided by two different RPM packages, depending on you are using the Unbreakable Enterprise Kernel (UEK) or the Red Hat Compatible Kernel (RHCK), so first we have to check the of the currently used kernel's flavor:
uname -r |grep --color 'el[a-z0-9_]*'
if the outcome string, as the following one, contains the "uek" word, then it is an Unbreakable Enterprise Kernel (UEK):
5.15.0-101.103.2.1.el9uek.x86_64
in this case, install the "kernel-uek-modules" RPM package as follows:
dnf install -y kernel-uek-modules
otherwise, if the outcome string, as the following one, does not contain the "uek" word, then it is an Red Hat Compatible Kernel (RHCK):
5.14.0-284.30.1.el9_2.x86_64
in this case, install the "kernel-modules" RPM package as follows:
dnf install -y kernel-modules
Install The Software
OpenVSwitch is not shipped with Oracle Linux, but since Oracle Linux is binary compatible with Red Hat Enterprise Linux, and the same is for CentOS Linux, we can download the pre-built RPM packages freely shipped by CentOS
SSH connect to the "as-ca-ut1a001" VM as follows:
vagrant ssh as-ca-ut1a001
then switch to the "root" user again:
sudo su -
to be tidy, we create a directory tree where to download the RPM packages we are about to install:
mkdir -m 755 /opt/rpms /opt/rpms/3rdpart
cd /opt/rpms/3rdpart
let's start by downloading every OpenVSwitch RPM packages of our desired version and build:
URL=https://cbs.centos.org/kojifiles/packages
OVSPACKAGE=openvswitch3.1
OVSVERSION=3.1.0
OVSBUILD=65.el9s
ARCH=$(uname -i)
wget ${URL}/${OVSPACKAGE}/${OVSVERSION}/${OVSBUILD}/${ARCH}/${OVSPACKAGE}-${OVSVERSION}-${OVSBUILD}.${ARCH}.rpm
wget ${URL}/${OVSPACKAGE}/${OVSVERSION}/${OVSBUILD}/${ARCH}/${OVSPACKAGE}-devel-${OVSVERSION}-${OVSBUILD}.${ARCH}.rpm
wget ${URL}/${OVSPACKAGE}/${OVSVERSION}/${OVSBUILD}/${ARCH}/${OVSPACKAGE}-ipsec-${OVSVERSION}-${OVSBUILD}.${ARCH}.rpm
wget ${URL}/${OVSPACKAGE}/${OVSVERSION}/${OVSBUILD}/${ARCH}/python3-${OVSPACKAGE}-${OVSVERSION}-${OVSBUILD}.${ARCH}.rpm
the OpenVSwitch RPM package depends on "openvswitch-selinux-extra-policy" RPM package, so let's download it as well:
SELINUX_POLICY_PACKAGE=openvswitch-selinux-extra-policy
SELINUX_POLICY_VERSION=1.0
SELINUX_POLICY_BUILD=31.el9s
wget ${URL}/${SELINUX_POLICY_PACKAGE}/${SELINUX_POLICY_VERSION}/${SELINUX_POLICY_BUILD}/noarch/${SELINUX_POLICY_PACKAGE}-${SELINUX_POLICY_VERSION}-${SELINUX_POLICY_BUILD}.noarch.rpm
we can now install the software as follows:
dnf install -y ${OVSPACKAGE}-${OVSVERSION}-${OVSBUILD}.${ARCH}.rpm ${SELINUX_POLICY_PACKAGE}-${SELINUX_POLICY_VERSION}-${SELINUX_POLICY_BUILD}.noarch.rpm NetworkManager-ovs net-tools
start OpenVswitch and enable it at boot:
systemctl enable --now openvswitch
we need also to restart also NetworkManager, to make it load also the OpenVSwitch (OVS) module we just installed:
systemctl restart NetworkManager
some of the packages we are about to install are provided by the EPEL repo - enable it as follows:
dnf -y install oracle-epel-release-el9
now let's install Podman, along the bridge-utils and jq RPM packages
dnf install -y podman bridge-utils jq
since Podman delays the creation of the bridges used by its networks until at least one container is started, we need bridge-utils to be able to manually create that bridge in advance, so to be able to link it to the OpenVSwitch bridge with the GENEVE port used to link to the other host.
Configure The Private Networks
Podman's Net1 Private Network
On both the application servers, we create a Podman's private network called "net1": the subnet must be the same on both hosts (192.168.120.0/24) but we must set different IP ranges to assign to the containers, so as to avoid IP collisions. In addition to that, packages must be able to fit the GENEVE tunnel's MTU - in this example we are not using Jumbo frames, so we lower the MTU to 1450.
Connect to the "as-ca-ut1a001" host and, as the root user, type:
podman network create --internal \
--subnet 192.168.120.0/24 \
--ip-range=192.168.120.1-192.168.120.126 \
--opt mtu=1450 \
net1
the above statement created the "net1" private network - Let's spawn an Alpine Linux container on it:
podman run -ti --rm --net=net1 alpine sh
let's check the container's network configuration:
ip -4 -o addr show dev eth0
the outcome is as follows:
2: eth0 inet 192.168.120.2/24 brd 192.168.120.255 scope global eth0\ valid_lft forever preferred_lft forever
leave the container open and, from another terminal, connect to the "as-ca-ut1a002" host.
Once logged on, switch to the root user and type:
podman network create --internal \
--subnet 192.168.120.0/24 \
--ip-range=192.168.120.129-192.168.120.254 \
--opt mtu=1450 \
net1
then spawn an Alpine Linux container also on this host:
podman run -ti --rm --net=net1 alpine sh
and also here, let's check the container's network configuration:
ip -4 -o addr show dev eth0
the outcome is as follows:
2: eth0 inet 192.168.120.129/24 brd 192.168.120.255 scope global eth0\ valid_lft forever preferred_lft forever
as expected, both containers have an IP address from the 192.168.120.0/24 network.
Now, from the current container (the one with IP 192.168.120.129) we launched on the "as-ca-ut1a001" application server, let's try to ping the container on the "as-ca-ut1a002" application server:
ping -c 1 192.168.120.2
the outcome is:
PING 192.168.120.2 (192.168.120.2): 56 data bytes
--- 192.168.120.2 ping statistics ---
1 packets transmitted, 0 packets received, 100% packet loss
this must not surprise us: of course on both the applications servers we assigned the same subnet to the "net1" Podman's network, but they are actually two distinct networks, running on two distinct hosts.
Exit both containers:
exit
We are about to interconnect both the Podman’s net1 private networks using a GENEVE tunnel.
First, on both the application servers, create the ethernet bridge used by the Podman's "net1" private network as follows:
PODMAN_BRIDGE=$(podman network inspect net1 | jq -r ".[]|.network_interface")
brctl addbr ${PODMAN_BRIDGE}
ip link set ${PODMAN_BRIDGE} up
VEth's Used to Connect the Podman's bridge to the OVS bridge
We are going to create a GENEVE tunnel using OpenVSwitch: this of course requires creating an OVS bridge and interconnecting it with the bridge we just created to backup Podman's "net1" private net.
To interconnect these two bridges, we need a VEth interfaces pair - let's create them on both the application servers as follows:
nmcli connection add type veth con-name veth0 ifname veth0 veth.peer veth1
on both the application servers, we can now link the veth0 interface to the Podman's bridge:
brctl addif ${PODMAN_BRIDGE} veth0
OpenVSwitch Bridge
It is now necessary to create the "ovs_net1" OpenVSwitch bridge - on both the application servers type the following statements:
ovs-vsctl add-br ovs_net1
and link the "veth1" interface to it:
ovs-vsctl add-port ovs_net1 veth1
last but not least, turn both the VEths interfaces up:
ip link set veth1 up
ip link set veth0 up
GENEVE Tunnels
We are finally ready to set up the GENEVE tunnels, but before that, on both the application servers, we need to create its related firewalld service and enable it to permit the GENEVE traffic between the hosts of the service subnet (192.168.0.0/24).
Create the "/etc/firewalld/services/geneve.xml" file with the following contents:
<?xml version="1.0" encoding="utf-8"?>
<service>
<short>GENEVE</short>
<description>Enable GENEVE incoming traffic</description>
<port protocol="udp" port="6081"/>
</service>
then create the firewall rule that accepts GENEVE traffic only from the "192.168.0.0/24" network:
firewall-cmd --permanent --zone=services1 --add-rich-rule='rule family="ipv4" source address="192.168.120.0/24" service name="geneve" accept'
finally reload firewalld to apply the configuration:
firewall-cmd --reload
let's make sure that it actually loaded it:
firewall-cmd --list-all --zone=services1
the outcome must be as follows:
services1 (active)
target: default
icmp-block-inversion: no
interfaces: enp0s6.100
sources:
services:
ports:
protocols:
forward: no
masquerade: no
forward-ports:
source-ports:
icmp-blocks:
rich rules:
rule family="ipv4" source address="192.168.0.0/24" service name="geneve" accept
we can now create the GENEVE tunnels, let's start from the "as-ca-ut1a001" host - connect to it, and as the root user, type:
ovs-vsctl add-port ovs_net1 to_as-ca-ut1a002 -- set interface to_as-ca-ut1a002 \
type=geneve options:remote_ip=192.168.0.11
then connect to the "as-ca-ut1a002" host and, as the root user, type:
ovs-vsctl add-port ovs_net1 to_as-ca-ut1a001 -- set interface to_as-ca-ut1a001 \
type=geneve options:remote_ip=192.168.0.10
since the GENEVE endpoint uses port UDP/6081, we can easily make sure the above statements are actually implemented - just type:
ss -lnup |grep 6081
the outcome must be as follows:
UNCONN 0 0 0.0.0.0:6081 0.0.0.0:*
UNCONN 0 0 [::]:6081 [::]:*
we are almost ready to launch again a container and perform a connectivity check, but before we must turn the "ovs_net1" bridge up:
ip link set ovs_net1 up
then, on both the application servers, launch an instance of the Alpine container:
podman run -ti --net=net1 alpine sh
now, on both containers, print the IP address:
ip -4 -o addr show dev eth0
on my system, the one running on the "as-ca-ut1a001" host shows:
2: eth0 inet 192.168.120.3/24 brd 192.168.120.255 scope global eth0\ valid_lft forever preferred_lft forever
whereas the one running on the "as-ca-ut1a002" host shows:
2: eth0 inet 192.168.120.130/24 brd 192.168.120.255 scope global eth0\ valid_lft forever preferred_lft forever
let's try to ping the container running on the "as-ca-ut1a002" host from the container running on the "as-ca-ut1a001" host - on the container running on the "as-ca-ut1a001" host type:
ping -c 1 192.168.120.130
this time the outcome must be as follows:
PING 192.168.120.130 (192.168.120.130): 56 data bytes
64 bytes from 192.168.120.130: seq=0 ttl=42 time=1.982 ms
--- 192.168.120.130 ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max = 1.982/1.982/1.982 ms
it is now working thanks to the GENEVE tunnel that is now interconnecting the two Podman's private networks.
Exit the container:
exit
Persisting The Networking Setup
As every nice thing, even this set up won't last long, ... at the first reboot, the very most of it will get gone.
We can however make it persistent by creating a script and a Systemd unit to trigger it at boot.
Create the directory tree for storing the script:
mkdir -m 755 /opt/grimoire /opt/grimoire/bin
then create the "/opt/grimoire/bin/podman-ovs.sh" script with the following contents:
#!/bin/bash
PODMAN_BRIDGE=$(podman network inspect net1 | jq -r ".[]|.network_interface")
brctl addbr ${PODMAN_BRIDGE}
sleep 2
ip link set ${PODMAN_BRIDGE} up
brctl addif ${PODMAN_BRIDGE} veth0
ip link set veth1 up
ip link set veth0 up
ip link set ovs_net1 up
If your system is affected the "Bug 1915284 - veth device profiles activation is not reboot persistent" that prevents the veths from being recreated at system boot, you need to add the following statement at line 6, right before the "brctl addif ${PODMAN_BRIDGE} veth0" statement.
ip link add veth0 type veth peer name veth1
and set it executable:
chmod 755 /opt/grimoire/bin/podman-ovs.sh
create the "/etc/systemd/system/podman-ovs.service" Systemd unit we use to trigger the script at boot time:
[Unit]
Description=Link Podman To OpenVSwitch
After=network.target network.service openvswitch.service podman.service
Requires=ovsdb-server.service
Requires=ovs-vswitchd.service
Requires=openvswitch.service
Requires=podman.service
[Service]
Type=idle
ExecStart=/opt/grimoire/bin/podman-ovs.sh
[Install]
WantedBy=multi-user.target
reload Systemd to make it aware of the new "podman-ovs.service" unit:
systemctl daemon-reload
and enable the "podman-ovs.service" unit to start at boot:
systemctl enable podman-ovs.service
Reboot the VM:
shutdown -r now
Test: download a file from a web service
Now that everything is up and running we can mock a web service and test.
On the "as-ca-ut1a001" application server, launch a container with the official Python container image:
podman run -ti --net=net1 python bash
Since there are neither the "ip" nor the "ipconfig" command line tools on this container image, we use Python to detect the IP configuration - launch Python as follows:
python3
then run the following code snippet to get the IP address:
import socket
hostname = socket.gethostname()
print(socket.gethostbyname(hostname))
on my system it detects "192.168.120.2".
Exit back to the shell:
exit()
then change to the "/etc" directory:
cd /etc
and launch the Python HTTP server:
python3 -m http.server 8080
on "as-ca-ut1a002" application server, start again an instance of the Alpine container:
podman run -ti --net=net1 alpine sh
then download the "motd" file from the python container running on the "as-ca-ut1a001" host:
wget -q -O - http://192.168.120.6:8080/motd
the contents are printed right to the standard output.
While we are at it, ... just to demonstrate once again the importance of security, type the following statement:
wget -q -O - http://192.168.120.6:8080/passwd
again, the contents are printed right to the standard output.
But even worse:
wget -q -O - http://192.168.120.6:8080/shadow
so we are getting the contents of the "/etc/shadow" file: this is a lab and we are inside a container, ... so no worries, but I wanted to show how easy it is to weaken security if you don't take enough care of everything.
Footnotes
Here it ends this post dedicated to GENEVE and Podman: I hope that having seen everything in action helped you to get a good understanding of how Overlay networks do work. Having a good understanding of overlay networking is a really valuable skill, since they are not only bricks, but even pillars of modern and resilient infrastructures. I hope the content shown in this post is enough to let you continue by yourself to explore this amazing topic.
I hope you enjoyed it, and if you liked it please share this post on Linkedin: if I see it arouses enough interest, we can go on this topic spending some time writing a post explaining how to set OpenFlow rules.