Regular Expressions Tutorial

From Free Knowledge Base- The DUCK Project: information for everyone
Jump to: navigation, search

RegEx

Creation Date: Thu Apr 22 12:43:36 CDT 2004 current ver 0.11


REGULAR EXPRESSIONS- notes collection and general reference including examples


Regular Expressions, also known as RegEx, can save you time and money!

This reference applies to vi/vim and grep/egrep for the most part. It is useful to be familiar with some basic vi conventions. Use CNTRL-V for vi to accept ASCII control characters such as carriage return. For example, if you wish to add CR's in an html file for the beginning of every table row <tr> tag you would :%s/<tr>/{CNTRL-V}{CR}<tr>/ In {} brackets are key combinations, you holding down the control key and pressing v, then not holding down control and pressing the Enter key. Within vi on the terminal screen what you typed would appear as 1,$S/<\/td>/^n<\/td>/


Single character matching is the principle to which vi operates. To match every 'w' in this document only if the 'w' is the first character in a line of text, type:

 /^w

The slash is the vi search character (refer to the vi command reference), the carrot ^ is part of regular expressions that indicates the beginning of the line.

 ^  Match the beginning of a line
 $  Match the end of a line

Typing /^useful causes vi to match any occurance of 'useful' string only when at the beginning of a line. It is a pattern of single characters, or single character patterns.

It is possible to group characters in a set. [ and ] represent a group pattern with a list of characters inside. For example, /^[abc] will match any occurance of the letter 'a', 'b', or 'c' individually and at the beginning of a line. /^[abc][abc] tells vi to match any two characters that each individually are a, b, or c and starting at the beginning of a line (such as the 'ac' in 'accept').

Ranges are also possible. To match any lowercase letter at the beginning of a line of text type /^[a-z] To match all numbers anywhere in the document type /[0-9] or to match all alphabetic characters upper and lowercase at the beginning of every line type /[a-zA-Z]

[abc]    Is a single-character pattern that matches
         either the letter a, b or c
[ab0-9]  Is a single-character pattern that matches
         either a or b or a digit in the ascii range
         from zero to nine
[a-zA-Z0-9\-] This matches a single-character that
              is either an upper case or lower case
              letter, a digit or the minus sign.

Inverted sets are also possible using a set definition with "[^" instead of "[". Inverting a ^ changes the meaning from beginning of the line to an inverted set.

[0-9]    Is a single character pattern that matches
         a digit in the ascii range from zero to nine.
[^0-9]   Match any single NON-digit character.
[^abc]   Match any single character that is not an
         a, b or c.

There are special characters such as the '.' dot wildcard and '*' multipler. You may be accustomed to using the * asterik as a whildcard, but this is _not_ the case in regular expressions.

.  matches one occurance of anything accept a new line character
*  multiplier determines how often a single-character pattern must occur 
-  indicates a range

Special characters can be expressed literally by using a backslash. Preceding a special character with a backslash, such as \. will cause the '.' to be taken as its literal meaning and not as its reserved function characteristic.

To search for and match 2 positions in lines with a space as the second character in vi you simply type

 /^.\ 

To do the same for lines with any number as the third character type:

 /^..[0-9]

Matchs for lines that start with anything other than 'a':

 /^[^a]

Now to get multipliers involved lets take a look at some matches where anything can be in the middle. Match any line string beginning with 'a' with any number of any characters in the middle and terminating with the last word 'the' in a line:

 /^a.*the

Notice how the first occurance of the word 'the' will be ignored and a match continues to the very last occurance of the string 'the'? A multiplier will basically swollow up everything until the last match.

For more complicated search and replace operations, it becomes necessary to stuff some of the text string or a single character into memory. Parentheses are a memory construct in regular expressions. What is enclosed in them is remembered and used later on. In the vi/vim editor the parentheses syntax must include backslashes.

Memory constructs are not that useful for simple searches such as those we have demonstrated above. They are, however, absolutly necessary for complex search and replace operations.

In an html document I have several images. I need to change the extension of every image that represents an indexed part from 'jpg' to 'gif', and not alter any other image names. It seems that indexed parts in our example always start with the string 'gm' and are followed by a sequence of numbers of variable length, and then concluded with the '.jpg' extension. Example:

 <img src="gm9283900.jpg>
 <img src="gm66001.jpg>

Since we do not wish to modify any other image tags in the html document, we must be careful how we contruct our real expression for pattern matching.

 :%s/\(<img src="gm[0-9]*\.\)jpg/\1gif/

Now we will break down each component of the real expression to understand better how it works as a whole.

The syntax for a search and replace in vi/vim is:

 :%s/<string1>/<string2>

We want to stuff part of the string in memory, as we do not wish to modify that particular part of the string, but wish to include it in the pattern matching as to be specific enough to avoid unwanted matches.

\(

Backlash and Parenthese to tell vi where to start storing in memory the following set of characters

<img src="gm[0-9]*\.

Which include the first portion of an html image tag, and the constant 'gm' which is in all the images we want to mdofiy, followed by [0-9] to match any number * (multiplier) as many times as needed until \. a literal period.

 \)jpg

Stop storing in memory here so that 'jpg' is excluded from the memory stored string. We want to throw away 'jpg.

 \1gif

The \1 recalls everything stored in memory, which is the exact pattern matched in the first part of the vi statement, and appends onto the end the 'gif' extension.

It is possible to remove all leading spaces from a text file in vi using real expressions. Type:

 :s/^  *//

Notice that it was necessary to have two spaces after the ^ carrot.

Some more examples:

I have an html file where all the sourcecode is on a single line. I want to create line breaks {CR} after each <tr> table row starts and concludes.

 :%s/<tr>/<tr>/g
 :%s/<\/tr>/<\/tr>/g

By default, vi/vim will match only once per line. the 'g' and the end tells vi to match multiple times per line. In vi the substitution command :%s/ / /gc is used. The percent refers to the ex-range 'whole file' and can be replaced by any appropriate range. E.g in vim you type shift-v, mark an area and then use the substitution on that area only. I don't explain more about vim here as this would be a tutorial on its own. The 'gc' is the interactive version. The no interactive is s/ / /g

Now to create a line break {CR} for each <td> table cell tag:

 :%s/<td>/<td>/g
 :%s/<\/td>/<\/td>/g

To add some indentation in the html source code:

 :%s/<tr>/  <tr>/g
 :%s/<\/tr>/  <\/tr>/g
 :%s/<\/td>/    <\/td>/g
 :%s/<td>/    <td>/g

Change all tags that reference some old graphics directory to use our /images subdirectory:

 :%s/<img src=.*graphics/<img src=images/

To use the /images directory on the local server as opposed to a remote server in your html document:

 :%s/img src="http:\/\/www.geoshitties.com\/blockhead src="images/

Add a file extension of .gif on a bunch of images that start with a specific string of text and do not currently have a file extension:

 :%s/\(moparPN_[0-9]*_[0-9]*\)/\1.gif/

Starting at line 6 we want to add a two space indent to every line in a text file:

 :6,$s/\(.*\)/\ \ \1/

Preventing RegEx Greediness: * will match as many characters as possible. Usually you want to be greedy, but not always. If you don't want to be greedy, add a question mark after the *

For example, if you want to strip HTML tags the following won't work the way you probably want.

  :%s/<.*>//sg;       # Matches TOO Much! It sees "<html>...</html>" 

as one big /<.*>/ (starts with "<" and ends with ">").

  :%s/<.*?>//g;      # Solves the greediness problem by using the shortest
                       possible match. 
  :%s/<[^>]*>//g;    # Also works.