A pillar every Linux professional must have are certainly Regular Expressions: they are broadly used every time is required to lookup or substitute a pattern that matches some criteria. Tools such as grep and sed miss almost all of their power if who is using them does not have a good understanding of Regular Expressions. This is really a huge topic: there's more than one book that is fully focused on regular expressions - this post is only a quick guide: the aim of this post is to let the reader get the gist of what Regular Expressions are by explaining everything that is needed to know to face the common use cases that may arise during daily work..

The purpose of Regular Expressions

Regular expressions provide  a way to express a pattern to match by using a standard syntax made of a sequence of characters. This post illustrate their usage by examples using grep.

When dealing with Perl-Compatible Regular Expression (PCRe), you can get some handy documentation
by installing pcre-devel RPM package: this let you use PCRE syntax man page as follows:

man pcresyntax

Meta-Characters

Metacharacters are reserved symbols used to assist in matching: the following list summarizes the most commonly used meta-characters:

.
it is the wildcard character: this means that can match any single character
^

the Caret character means the beginning of the line

grep "^Nov 10" messages.1
$

the Dollar character means the end of the line

grep "terminating$" messages

the following pattern searches for a line of only three characters in "foo.txt" file:

grep "^...$" foo.txt

The following example can be used to print the contents of Python files stripping the comments:

grep -v "^#\|^'\|^\/\/" mycode.py

commented lines in Python begin by character #, ' or // : the supplied regular expression matches only when a commented out line is found, then the "-v" option of grep reverts the match.

Sometimes it is needed to use a pattern that includes dots that should not be evaluated as a wildcard: simply escape the dot character using a backslash as in the following example:

grep "192\.168\.50\.31"  /var/log/secure

Class of characters

It is possible to express a group (class) of characters by enclosing the characters between square brackets [] using a specific syntax. For example:

[hc]
is a class of characters that contains only h and c characters
[0-9]
is a class of characters that contains all the digits (it goes from 0 to 9)
[a-zA-Z]
is a class of characters that contains all the alphas

Extended regular expressions introduce some additional built-in character classes - the most used are:

[:digit:]
all the digits from 0 to 9
[:alnum:]
any alphanumeric character - from 0 to 9 OR from A to Z or from a to z
[:alpha:]
any alpha character from A to Z or from a to z
[:lower:]
any lowercase alpha character from a to z
[:upper:]
any uppercase alpha character from A to Z
[:blank:]
space and TAB characters only
[:punct:]
punctuation characters - [][!"#$%&'()*+,./:;<=>?@\^_`{|}~-]

grep does not evaluate extended regular expressions by default: their support must be explicitly enabled
by supplying the -e command switch when needed. For example:

grep -e "ntpd\[[[:digit:]]\+\]" /var/log/messages

The above command checks for all of the matches of words that begin with "ntpd" followed by a number of digits

Negating a class of characters

When the ^ (Caret) meta-character precedes a character class it negates the class of characters to match: for example

grep -i  "^[^aeiou]" /usr/share/dict/linux.words

matches only when it find words without vowels.

Quantification

The following symbols are used to quantify the number of times the meta-character or class of characters should be repeated to have a match :

?

The previous character or class should match at most one time - this means that it can also match
zero times

grep "hi \?hello" input
*
the previous character or class match zero or more times.
+

the previous character or class match at least one time

grep "hi \+hello" input

Conversely from GNU and Posix RegEx, that implements only the greedy behavior, Perl RegEx implements three different flavors of quantification: let's test them with the string "http://grimoire.carcano.ch/webfolder/foofile.html" - issue:

URL="http://grimoire.carcano.ch/webfolder/foofile.html"
*
greedy

the default – it enlarges the match as much as possible - for example:

echo $URL|perl -pe 's|(http://.*/).*|\1|'

the output is

http://grimoire.carcano.ch/webfolder/
*?
reluctant

sometimes told also lazy – it considers the match as small as possible – for example:

echo $URL|perl -pe 's|(http://.*?/).*|\1|'

the output is

http://grimoire.carcano.ch/
*+
possessive

perform a greedy match, but it returns the whole string – for example:

echo $URL|perl -pe 's|(http://.*+/).*|\1|'

the output is:

http://grimoire.carcano.ch/webfolder/foofile.html

that is the whole string indeed.

Neither basic nor extended Posix/GNU regex recognize the non-greedy quantifier
– we used Perl in the examples indeed. In addition to that take in account that the reluctant
quantifier "costs" much more the greedy one
, … so use it only when you really need it.

The following table summarizes quantifiers and the flavour-specific syntax:

GNU/Posix
Perl
Quantifier
Greedy
Lazy
Possessive
Zero or more
*
*?
*+
One or more
+
+?
++
Zero or none
?
??
?+

it is also possible to state the number of times the match should happen:

{n}

the previous character or class match exactly n times

grep  "^[0-9]\{5\}$" number
{n,}

the previous character or class match n or more times

grep "[0-9]\{5,\}" number
{,m}
the previous character or class match at most m times
{n,m}

the previous character or class match at least n times, but not more than m times

grep  "^[0-9]\{1,5\}$" number

Examples

The following example uses a class of only digits:

grep "[0-9]\+ times" /var/log/messages

it matches every time at least a digit is found - note how we had to escape the plus quantification meta-character

This example validates an IPv4 address:

egrep  '\b(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?\.){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)' input

this last example uses the OR operator to matches some variations of “Object Oriented”:

grep "OO\|\([oO]bject\( \|\-\)[oO]riented\)"
ls |xargs rm

Footnotes

Here it ends the quick-guide: although I use regular expressions, I cannot always remember everything, so I wrote it for my own needs, but as it grows I thought that it has become quite mature and that somebody else may benefit if I publish it. So here it is: I hope you enjoyed it.

I hate blogs with pop-ups, ads and all the (even worse) other stuff that distracts from the topics you're reading and violates your privacy. I want to offer my readers the best experience possible for free, ... but please be wary that for me it's not really free: on top of the raw costs of running the blog, I usually spend on average 50-60 hours writing each post. I offer all this for free because I think it's nice to help people, but if you think something in this blog has helped you professionally and you want to give concrete support, your contribution is very much appreciated: you can just use the above button.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>