5.0 Regular Expressions
1. Regular expressions: Think of these as special characters that
can be used in searching for text strings in files or in editing
sessions. They are contrasted with Shell file name expansion that
only supplies pattern matching in the current directory, not in a text file.
Special Characters
Regular Expression Shell File Name Expansion
------------------------- -----------------------------------
. ? Any character in a single
character position
* 0 or more instances of the
preceding character
.* * 0 or more instances of any
character
[ ] [ ] Selected characters or ranges
in a single character position
[^ ] [! ] Any characters except the
selected characters or ranges
in a single character position
^x ^ An anchor to the beginning of
the line. x must be the first
character on the line.
x$ $ An anchor to the end of the
line. x must be the last
character on the line.
\ \ Makes the next character
ordinary (not special).
\( \) \1 Capture a pattern and store or
retrieve from numbered buffer 1
(up to 9 numbered buffers)
\{m\} Exactly m instances of the
previous character
\{m,\} At least m instances of the
previous character
\{,n\} At most n instances of the
previous character
\{m,n\} Between m and n instances
(0 <= m <= n <= 256)
of the previous character
2. Refer to the Appendix A on Regular Expressions in the Sobell Book.
Create text while in vi and use the commands involving
regular expressions e.g. use / ? or :s to exercise the examples given.
3. POSIX bracket Expressions (sometimes known as Character Classes)
This is a special metasequence for use within a POSIX Bracket Expression
[ ]
An example of this is: [:lower:] which represents the class of lower case
letters (relative to a locale) and is comparable to the character range:
a-z. However, these colon delimited expressions are only valid inside the
brackets, so we have: [[:lower:]] as depicting the [a-z] lower case letter
range.
The list of POSIX character classes that is usually supported (locale
dependent) is indicated below:
[:alnum:] alphabetic characters and numeric characters
[:alpha:] alphabetic characters
[:blank:] space and tab
[:cntrl:] control characters
[:digit:] digits
[:graph:] non-blank (not spaces, control characters, etc.)
[:lower:] lower case alphabetics
[:upper:] upper case alphabetics
[:print:] like [:graph:] but includes the space character as well
[:punct:] punctuation characters
[:space:] all whitespace characters ( {:blank:], newline, carriage
return, etc.)
[:xdigit:] digits allowed in a hexadecimal number: [0-9a-fA-F]
4. POSIX Bracket Expression Character Equivalents
Some locales define character equivalents to indicate certain characters
should be considered identical for sorting purposes. (e.g. a and a with
an accent mark above it). This is referenced by = instead of :
For example, all the kinds of 'a' in the locale's character equivalents
would be depicted: [[=a=]] and represent all the kinds of 'a' in a single
character position.
In the absense of accented characters, [[=a=]] would default to [a]
5. POSIX Bracket Expression Collating Sequences.
A locale can have a collating sequence to describe how certain characters
or sets of characters should be treated for sorting purposes. A collating
sequence that maps certain (sets of) characters to a single logical
character is considered 'one character' for regular expression purposes.
This would mean that both the special character . and the regular
expression [^123] would match this single logical character.
A collating sequence element can be included within a bracket expression
using a '.' instead of a : or =. For example, the notation:
torti[[.span-ll.]]a matches the word tortilla as does torti.a
The spanish collating sequence matches the 'll' for tortilla rendered in
English. Also the 'll' comes between the l and m of the English alphabet.
A collating sequence lets you match against those characters that are made
up of other character combinations. It also creates a situation where a
bracket expression can match more than one physical character.
Another example is of a german collating sequence, called [.ezret.] puts
the german letter that looks like a script upper case B (between S and T
in the german alphabet.
Questions? Robert Katz: rkatz@ned.highline.edu
Last Update July 23, 2002