Introduction to UNIX Notes:50

5.0 Regular Expressions

1. Regular expressions: Think of these as special characters that 
can be used in searching for text strings in files or in editing 
sessions.  They are contrasted with Shell file name expansion that 
only supplies pattern matching in the current directory, not in a text file.

			Special Characters

Regular Expression         	   Shell File Name Expansion
-------------------------          -----------------------------------
      .				      ?		Any character in a single
						character position
      
      *						0 or more instances of the 
						preceding character		
      
      .*			      *		0 or more instances of any 
						character		
      
      [  ]			    [  ]        Selected characters or ranges
						in a single character position
      
      [^  ]			    [!  ]       Any characters except the 
						selected characters or ranges
						in a single character position
      
      ^x					^ An anchor to the beginning of
						the line. x must be the first
						character on the line.
      
      x$					$ An anchor to the end of the
						line. x must be the last
						character on the line.

      \			     	      \		Makes the next character 
						ordinary (not special).

    \(  \)  \1					Capture a pattern and store or
						retrieve from numbered buffer 1
						(up to 9 numbered buffers)

    \{m\}					Exactly m instances of the
						previous character

    \{m,\}					At least m instances of the
						previous character

    \{,n\}					At most n instances of the
						previous character

    \{m,n\}					Between m and n instances
						(0 <= m <= n <= 256)
						of the previous character

2. Refer to the Appendix A on Regular Expressions in the Sobell Book.  
   Create text while in vi and use the commands involving
   regular expressions e.g. use / ? or :s to exercise the examples given.

3. POSIX bracket Expressions (sometimes known as Character Classes)

   This is a special metasequence for use within a POSIX Bracket Expression
   [    ]
   An example of this is: [:lower:] which represents the class of lower case
   letters (relative to a locale) and is comparable to the character range: 
   a-z. However, these colon delimited expressions are only valid inside the 
   brackets, so we have: [[:lower:]] as depicting the [a-z] lower case letter 
   range.

   The list of POSIX character classes that is usually supported (locale
   dependent) is indicated below:
      [:alnum:]     alphabetic characters and numeric characters
      [:alpha:]     alphabetic characters 
      [:blank:]     space and tab
      [:cntrl:]     control characters
      [:digit:]     digits
      [:graph:]     non-blank (not spaces, control characters, etc.)
      [:lower:]     lower case alphabetics
      [:upper:]     upper case alphabetics
      [:print:]     like [:graph:] but includes the space character as well
      [:punct:]     punctuation characters
      [:space:]     all whitespace characters ( {:blank:], newline, carriage
                        return, etc.)
      [:xdigit:]    digits allowed in a hexadecimal number: [0-9a-fA-F]
     
4. POSIX Bracket Expression Character Equivalents
   Some locales define character equivalents to indicate certain characters
   should be considered identical for sorting purposes. (e.g. a and a with
   an accent mark above it). This is referenced by = instead of : 
 
   For example, all the kinds of 'a' in the locale's character equivalents
   would be depicted: [[=a=]] and represent all the kinds of 'a' in a single
   character position.

   In the absense of accented characters, [[=a=]] would default to [a]

5. POSIX Bracket Expression Collating Sequences.
   A locale can have a collating sequence to describe how certain characters
   or sets of characters should be treated for sorting purposes. A collating
   sequence that maps certain (sets of) characters to a single logical 
   character is considered 'one character' for regular expression purposes.
   This would mean that both the special character . and the regular 
   expression [^123] would match this single logical character.

   A collating sequence element can be included within a bracket expression
   using a '.' instead of a : or =. For example, the notation:
   torti[[.span-ll.]]a matches the word tortilla as does torti.a 
   The spanish collating sequence matches the 'll' for tortilla rendered in
   English. Also the 'll' comes between the l and m of the English alphabet.
    
   A collating sequence lets you match against those characters that are made
   up of other character combinations. It also creates a situation where a
   bracket expression can match more than one physical character.

   Another example is of a german collating sequence, called [.ezret.] puts
   the german letter that looks like a script upper case B (between S and T
   in the german alphabet.
Questions? Robert Katz: rkatz@ned.highline.edu
Last Update July 23, 2002