A pillar every Linux professional must have are certainly Regular Expressions: they are broadly used every time is required to lookup or substitute a pattern that matches some criteria. Tools such as grep and sed miss almost all of their power if who is using them does not have a good understanding of Regular Expressions. This is really a huge topic: there's more than one book that is fully focused on regular expressions - this post is only a quick guide: the aim of this post is to let the reader get the gist of what Regular Expressions are by explaining everything that is needed to know to face the common use cases that may arise during daily work..
The purpose of Regular Expressions
Regular expressions provide a way to express a pattern to match by using a standard syntax made of a sequence of characters. This post illustrate their usage by examples using grep.
When dealing with Perl-Compatible Regular Expression (PCRe), you can get some handy documentation
by installing pcre-devel RPM package: this let you use PCRE syntax man page as follows:
man pcresyntax
Meta-Characters
Metacharacters are reserved symbols used to assist in matching: the following list summarizes the most commonly used meta-characters:
the Caret character means the beginning of the line
grep "^Nov 10" messages.1
the Dollar character means the end of the line
grep "terminating$" messages
the following pattern searches for a line of only three characters in "foo.txt" file:
grep "^...$" foo.txt
The following example can be used to print the contents of Python files stripping the comments:
grep -v "^#\|^'\|^\/\/" mycode.py
commented lines in Python begin by character #, ' or // : the supplied regular expression matches only when a commented out line is found, then the "-v" option of grep reverts the match.
Sometimes it is needed to use a pattern that includes dots that should not be evaluated as a wildcard: simply escape the dot character using a backslash as in the following example:
grep "192\.168\.50\.31" /var/log/secure
Class of characters
It is possible to express a group (class) of characters by enclosing the characters between square brackets [] using a specific syntax. For example:
Extended regular expressions introduce some additional built-in character classes - the most used are:
grep does not evaluate extended regular expressions by default: their support must be explicitly enabled
by supplying the -e command switch when needed. For example:
grep -e "ntpd\[[[:digit:]]\+\]" /var/log/messages
The above command checks for all of the matches of words that begin with "ntpd" followed by a number of digits
Negating a class of characters
When the ^ (Caret) meta-character precedes a character class it negates the class of characters to match: for example
grep -i "^[^aeiou]" /usr/share/dict/linux.words
matches only when it find words without vowels.
Quantification
The following symbols are used to quantify the number of times the meta-character or class of characters should be repeated to have a match :
The previous character or class should match at most one time - this means that it can also match
zero times
grep "hi \?hello" input
the previous character or class match at least one time
grep "hi \+hello" input
Conversely from GNU and Posix RegEx, that implements only the greedy behavior, Perl RegEx implements three different flavors of quantification: let's test them with the string "http://grimoire.carcano.ch/webfolder/foofile.html" - issue:
URL="http://grimoire.carcano.ch/webfolder/foofile.html"
the default – it enlarges the match as much as possible - for example:
echo $URL|perl -pe 's|(http://.*/).*|\1|'
the output is
http://grimoire.carcano.ch/webfolder/
sometimes told also lazy – it considers the match as small as possible – for example:
echo $URL|perl -pe 's|(http://.*?/).*|\1|'
the output is
http://grimoire.carcano.ch/
perform a greedy match, but it returns the whole string – for example:
echo $URL|perl -pe 's|(http://.*+/).*|\1|'
the output is:
http://grimoire.carcano.ch/webfolder/foofile.html
that is the whole string indeed.
– we used Perl in the examples indeed. In addition to that take in account that the reluctant
quantifier "costs" much more the greedy one, … so use it only when you really need it.
The following table summarizes quantifiers and the flavour-specific syntax:
it is also possible to state the number of times the match should happen:
the previous character or class match exactly n times
grep "^[0-9]\{5\}$" number
the previous character or class match n or more times
grep "[0-9]\{5,\}" number
the previous character or class match at least n times, but not more than m times
grep "^[0-9]\{1,5\}$" number
Examples
The following example uses a class of only digits:
grep "[0-9]\+ times" /var/log/messages
it matches every time at least a digit is found - note how we had to escape the plus quantification meta-character
This example validates an IPv4 address:
egrep '\b(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?\.){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)' input
this last example uses the OR operator to matches some variations of “Object Oriented”:
grep "OO\|\([oO]bject\( \|\-\)[oO]riented\)"
ls |xargs rm
Footnotes
Here it ends the quick-guide: although I use regular expressions, I cannot always remember everything, so I wrote it for my own needs, but as it grows I thought that it has become quite mature and that somebody else may benefit if I publish it. So here it is: I hope you enjoyed it.