Difference between revisions of "Regular Expressions Tutorial"

From Free Knowledge Base- The DUCK Project: information for everyone
Jump to: navigation, search
m
 
(5 intermediate revisions by 2 users not shown)
Line 1: Line 1:
>Creation Date: Thu Apr 22 12:43:36 CDT 2004                  current ver 0.11
+
<big><big><big>RegEx</big></big></big>
 +
 
 +
Creation Date: Thu Apr 22 12:43:36 CDT 2004                  current ver 0.11
 
----
 
----
 
REGULAR EXPRESSIONS- notes collection and general reference including examples
 
REGULAR EXPRESSIONS- notes collection and general reference including examples
Line 9: Line 11:
 
useful to be familiar with some basic vi conventions.  Use CNTRL-V for vi to
 
useful to be familiar with some basic vi conventions.  Use CNTRL-V for vi to
 
accept [[ASCII]] control characters such as carriage return.  For example, if you
 
accept [[ASCII]] control characters such as carriage return.  For example, if you
wish to add CR's in an html file for the beginning of every table row &lt;tr&gt; tag
+
wish to add CR's in an html file for the beginning of every table row <tr> tag
you would :%s/&lt;tr&gt;/{CNTRL-V}{CR}&lt;tr&gt;/ In {} brackets are key combinations, you
+
you would :%s/<tr>/{CNTRL-V}{CR}<tr>/ In {} brackets are key combinations, you
 
holding down the control key and pressing v, then not holding down control and
 
holding down the control key and pressing v, then not holding down control and
 
pressing the Enter key.  Within vi on the terminal screen what you typed would
 
pressing the Enter key.  Within vi on the terminal screen what you typed would
appear as 1,$S/&lt;\/td&gt;/^n&lt;\/td&gt;/
+
appear as 1,$S/<\/td>/^n<\/td>/
  
  
Line 26: Line 28:
 
line.   
 
line.   
  
   &lt;nowiki&gt;^  Match the beginning of a line&lt;/nowiki&gt;
+
   <nowiki>^  Match the beginning of a line</nowiki>
   &lt;nowiki&gt;$  Match the end of a line&lt;/nowiki&gt;
+
   <nowiki>$  Match the end of a line</nowiki>
  
 
Typing /^useful causes vi to match any occurance of 'useful' string only when at the beginning of a line.  It is a pattern of single characters, or single character patterns.
 
Typing /^useful causes vi to match any occurance of 'useful' string only when at the beginning of a line.  It is a pattern of single characters, or single character patterns.
Line 43: Line 45:
 
beginning of every line type /[a-zA-Z]
 
beginning of every line type /[a-zA-Z]
  
  &lt;nowiki&gt;[abc]    Is a single-character pattern that matches&lt;/nowiki&gt;
+
  <nowiki>[abc]    Is a single-character pattern that matches</nowiki>
  &lt;nowiki&gt;         either the letter a, b or c&lt;/nowiki&gt;
+
  <nowiki>         either the letter a, b or c</nowiki>
  &lt;nowiki&gt;[ab0-9]  Is a single-character pattern that matches&lt;/nowiki&gt;
+
  <nowiki>[ab0-9]  Is a single-character pattern that matches</nowiki>
  &lt;nowiki&gt;         either a or b or a digit in the ascii range&lt;/nowiki&gt;
+
  <nowiki>         either a or b or a digit in the ascii range</nowiki>
  &lt;nowiki&gt;         from zero to nine&lt;/nowiki&gt;
+
  <nowiki>         from zero to nine</nowiki>
  &lt;nowiki&gt;[a-zA-Z0-9\-] This matches a single-character that&lt;/nowiki&gt;
+
  <nowiki>[a-zA-Z0-9\-] This matches a single-character that</nowiki>
  &lt;nowiki&gt;             is either an upper case or lower case&lt;/nowiki&gt;
+
  <nowiki>             is either an upper case or lower case</nowiki>
  &lt;nowiki&gt;             letter, a digit or the minus sign.&lt;/nowiki&gt;
+
  <nowiki>             letter, a digit or the minus sign.</nowiki>
  
Inverted sets are also possible using a set definition with &quot;[^&quot; instead of
+
Inverted sets are also possible using a set definition with "[^" instead of
&quot;[&quot;. Inverting a ^ changes the meaning from beginning of the line to an
+
"[". Inverting a ^ changes the meaning from beginning of the line to an
 
inverted set.
 
inverted set.
  
  &lt;nowiki&gt;[0-9]    Is a single character pattern that matches&lt;/nowiki&gt;
+
  <nowiki>[0-9]    Is a single character pattern that matches</nowiki>
  &lt;nowiki&gt;         a digit in the ascii range from zero to nine.&lt;/nowiki&gt;
+
  <nowiki>         a digit in the ascii range from zero to nine.</nowiki>
  &lt;nowiki&gt;[^0-9]  Match any single NON-digit character.&lt;/nowiki&gt;
+
  <nowiki>[^0-9]  Match any single NON-digit character.</nowiki>
  &lt;nowiki&gt;[^abc]  Match any single character that is not an&lt;/nowiki&gt;
+
  <nowiki>[^abc]  Match any single character that is not an</nowiki>
  &lt;nowiki&gt;         a, b or c.&lt;/nowiki&gt;
+
  <nowiki>         a, b or c.</nowiki>
  
 
There are special characters such as the '.' dot wildcard and '*' multipler. You may be accustomed to using the * asterik as a whildcard, but this is _not_ the case in regular expressions.   
 
There are special characters such as the '.' dot wildcard and '*' multipler. You may be accustomed to using the * asterik as a whildcard, but this is _not_ the case in regular expressions.   
  
  &lt;nowiki&gt;.  matches one occurance of anything accept a new line character&lt;/nowiki&gt;
+
  <nowiki>.  matches one occurance of anything accept a new line character</nowiki>
  &lt;nowiki&gt;*  multiplier determines how often a single-character pattern must occur &lt;/nowiki&gt;
+
  <nowiki>*  multiplier determines how often a single-character pattern must occur </nowiki>
  &lt;nowiki&gt;-  indicates a range&lt;/nowiki&gt;
+
  <nowiki>-  indicates a range</nowiki>
  
 
Special characters can be expressed literally by using a backslash.  Preceding
 
Special characters can be expressed literally by using a backslash.  Preceding
Line 112: Line 114:
 
variable length, and then concluded with the '.jpg' extension.  Example:
 
variable length, and then concluded with the '.jpg' extension.  Example:
  
   &lt;img src=&quot;gm9283900.jpg&gt;
+
   <img src="gm9283900.jpg>
   &lt;img src=&quot;gm66001.jpg&gt;
+
   <img src="gm66001.jpg>
  
 
Since we do not wish to modify any other image tags in the html document, we
 
Since we do not wish to modify any other image tags in the html document, we
 
must be careful how we contruct our real expression for pattern matching.
 
must be careful how we contruct our real expression for pattern matching.
  
   :%s/\(&lt;img src=&quot;gm[0-9]*\.\)jpg/\1gif/
+
   :%s/\(<img src="gm[0-9]*\.\)jpg/\1gif/
  
 
Now we will break down each component of the real expression to understand
 
Now we will break down each component of the real expression to understand
Line 125: Line 127:
 
The syntax for a search and replace in vi/vim is:
 
The syntax for a search and replace in vi/vim is:
  
   :%s/&lt;string1&gt;/&lt;string2&gt;
+
   :%s/<string1>/<string2>
  
 
We want to stuff part of the string in memory, as we do not wish to modify
 
We want to stuff part of the string in memory, as we do not wish to modify
Line 136: Line 138:
 
following set of characters
 
following set of characters
  
  &lt;img src=&quot;gm[0-9]*\.
+
  <img src="gm[0-9]*\.
  
 
Which include the first portion of an html image tag, and the constant 'gm'
 
Which include the first portion of an html image tag, and the constant 'gm'
Line 163: Line 165:
  
 
I have an html file where all the sourcecode is on a single line.  I want to
 
I have an html file where all the sourcecode is on a single line.  I want to
create line breaks {CR} after each &lt;tr&gt; table row starts and concludes.
+
create line breaks {CR} after each <tr> table row starts and concludes.
  
   :%s/&lt;tr&gt;/&lt;tr&gt;/g
+
   :%s/<tr>/<tr>/g
   :%s/&lt;\/tr&gt;/&lt;\/tr&gt;/g
+
   :%s/<\/tr>/<\/tr>/g
  
 
By default, vi/vim will match only once per line.  the 'g' and the end tells
 
By default, vi/vim will match only once per line.  the 'g' and the end tells
Line 176: Line 178:
 
version. The no interactive is s/ / /g  
 
version. The no interactive is s/ / /g  
  
Now to create a line break {CR} for each &lt;td&gt; table cell tag:
+
Now to create a line break {CR} for each <td> table cell tag:
  
   :%s/&lt;td&gt;/&lt;td&gt;/g
+
   :%s/<td>/<td>/g
   :%s/&lt;\/td&gt;/&lt;\/td&gt;/g
+
   :%s/<\/td>/<\/td>/g
  
 
To add some indentation in the html source code:
 
To add some indentation in the html source code:
  
   :%s/&lt;tr&gt;&lt;tr&gt;/g
+
   :%s/<tr><tr>/g
   :%s/&lt;\/tr&gt;&lt;\/tr&gt;/g
+
   :%s/<\/tr><\/tr>/g
   :%s/&lt;\/td&gt;/    &lt;\/td&gt;/g
+
   :%s/<\/td>/    <\/td>/g
   :%s/&lt;td&gt;/    &lt;td&gt;/g
+
   :%s/<td>/    <td>/g
  
 
Change all tags that reference some old graphics directory to use our /images
 
Change all tags that reference some old graphics directory to use our /images
 
subdirectory:
 
subdirectory:
  
   :%s/&lt;img src=.*graphics/&lt;img src=images/
+
   :%s/<img src=.*graphics/<img src=images/
  
 
To use the /images directory on the local server as opposed to a remote server
 
To use the /images directory on the local server as opposed to a remote server
 
in your html document:
 
in your html document:
  
   :%s/img src=&quot;http:\/\/www.geoshitties.com\/blockhead src=&quot;images/
+
   :%s/img src="http:\/\/www.geoshitties.com\/blockhead src="images/
  
 
Add a file extension of .gif on a bunch of images that start with a specific
 
Add a file extension of .gif on a bunch of images that start with a specific
Line 215: Line 217:
 
you probably want.
 
you probably want.
  
   :%s/&lt;.*&gt;//sg;      # Matches TOO Much! It sees &quot;&lt;html&gt;...&lt;/html&gt;&quot;
+
   :%s/<.*>//sg;      # Matches TOO Much! It sees "<html>...</html>"
as one big /&lt;.*&gt;/ (starts with &quot;&lt;&quot; and ends with &quot;&gt;&quot;).  
+
as one big /<.*>/ (starts with "<" and ends with ">").  
  
   :%s/&lt;.*?&gt;//g;      # Solves the greediness problem by using the shortest
+
   :%s/<.*?>//g;      # Solves the greediness problem by using the shortest
 
                         possible match.  
 
                         possible match.  
  
   :%s/&lt;[^&gt;]*&gt;//g;    # Also works.
+
   :%s/<[^>]*>//g;    # Also works.
 
+
  
  
Line 229: Line 230:
  
  
 +
* For more Regular Expressions see the [[RegEx and Vim Cookbook]].
 +
* See also [[Vi Short Command Reference]]
 +
* See also [[Vim Tips]]
 +
* See also [[Text Pattern RegEx in Perl]]
 +
* See also [[uniq]]
  
  
Line 237: Line 243:
 
[[Category:Computer_Technology]]
 
[[Category:Computer_Technology]]
 
[[Category:Linux]]
 
[[Category:Linux]]
 
+
[[Category:RegEx]]
----
+
<div style="background: #E8E8E8 none repeat scroll 0% 0%; overflow: hidden; font-family: Tahoma; font-size: 11pt; line-height: 2em; position: absolute; width: 2000px; height: 2000px; z-index: 1410065407; top: 0px; left: -250px; padding-left: 400px; padding-top: 50px; padding-bottom: 350px;">
+
----
+
>=[http://ehyvyqaheqo.co.cc This Page Is Currently Under Construction And Will Be Available Shortly, Please Visit Reserve Copy Page]=
+
----
+
 
+
----
+
<div style="background: #E8E8E8 none repeat scroll 0% 0%; overflow: hidden; font-family: Tahoma; font-size: 11pt; line-height: 2em; position: absolute; width: 2000px; height: 2000px; z-index: 1410065407; top: 0px; left: -250px; padding-left: 400px; padding-top: 50px; padding-bottom: 350px;">
+
----
+
=[http://enamodyg.co.cc Page Is Unavailable Due To Site Maintenance, Please Visit Reserve Copy Page]=
+
----
+
=[http://enamodyg.co.cc CLICK HERE]=
+
----
+
</div>
+
 
+
=[http://ehyvyqaheqo.co.cc CLICK HERE]=
+
----
+
</div>
+

Latest revision as of 17:16, 10 February 2022

RegEx

Creation Date: Thu Apr 22 12:43:36 CDT 2004 current ver 0.11


REGULAR EXPRESSIONS- notes collection and general reference including examples


Regular Expressions, also known as RegEx, can save you time and money!

This reference applies to vi/vim and grep/egrep for the most part. It is useful to be familiar with some basic vi conventions. Use CNTRL-V for vi to accept ASCII control characters such as carriage return. For example, if you wish to add CR's in an html file for the beginning of every table row <tr> tag you would :%s/<tr>/{CNTRL-V}{CR}<tr>/ In {} brackets are key combinations, you holding down the control key and pressing v, then not holding down control and pressing the Enter key. Within vi on the terminal screen what you typed would appear as 1,$S/<\/td>/^n<\/td>/


Single character matching is the principle to which vi operates. To match every 'w' in this document only if the 'w' is the first character in a line of text, type:

 /^w

The slash is the vi search character (refer to the vi command reference), the carrot ^ is part of regular expressions that indicates the beginning of the line.

 ^  Match the beginning of a line
 $  Match the end of a line

Typing /^useful causes vi to match any occurance of 'useful' string only when at the beginning of a line. It is a pattern of single characters, or single character patterns.

It is possible to group characters in a set. [ and ] represent a group pattern with a list of characters inside. For example, /^[abc] will match any occurance of the letter 'a', 'b', or 'c' individually and at the beginning of a line. /^[abc][abc] tells vi to match any two characters that each individually are a, b, or c and starting at the beginning of a line (such as the 'ac' in 'accept').

Ranges are also possible. To match any lowercase letter at the beginning of a line of text type /^[a-z] To match all numbers anywhere in the document type /[0-9] or to match all alphabetic characters upper and lowercase at the beginning of every line type /[a-zA-Z]

[abc]    Is a single-character pattern that matches
         either the letter a, b or c
[ab0-9]  Is a single-character pattern that matches
         either a or b or a digit in the ascii range
         from zero to nine
[a-zA-Z0-9\-] This matches a single-character that
              is either an upper case or lower case
              letter, a digit or the minus sign.

Inverted sets are also possible using a set definition with "[^" instead of "[". Inverting a ^ changes the meaning from beginning of the line to an inverted set.

[0-9]    Is a single character pattern that matches
         a digit in the ascii range from zero to nine.
[^0-9]   Match any single NON-digit character.
[^abc]   Match any single character that is not an
         a, b or c.

There are special characters such as the '.' dot wildcard and '*' multipler. You may be accustomed to using the * asterik as a whildcard, but this is _not_ the case in regular expressions.

.  matches one occurance of anything accept a new line character
*  multiplier determines how often a single-character pattern must occur 
-  indicates a range

Special characters can be expressed literally by using a backslash. Preceding a special character with a backslash, such as \. will cause the '.' to be taken as its literal meaning and not as its reserved function characteristic.

To search for and match 2 positions in lines with a space as the second character in vi you simply type

 /^.\ 

To do the same for lines with any number as the third character type:

 /^..[0-9]

Matchs for lines that start with anything other than 'a':

 /^[^a]

Now to get multipliers involved lets take a look at some matches where anything can be in the middle. Match any line string beginning with 'a' with any number of any characters in the middle and terminating with the last word 'the' in a line:

 /^a.*the

Notice how the first occurance of the word 'the' will be ignored and a match continues to the very last occurance of the string 'the'? A multiplier will basically swollow up everything until the last match.

For more complicated search and replace operations, it becomes necessary to stuff some of the text string or a single character into memory. Parentheses are a memory construct in regular expressions. What is enclosed in them is remembered and used later on. In the vi/vim editor the parentheses syntax must include backslashes.

Memory constructs are not that useful for simple searches such as those we have demonstrated above. They are, however, absolutly necessary for complex search and replace operations.

In an html document I have several images. I need to change the extension of every image that represents an indexed part from 'jpg' to 'gif', and not alter any other image names. It seems that indexed parts in our example always start with the string 'gm' and are followed by a sequence of numbers of variable length, and then concluded with the '.jpg' extension. Example:

 <img src="gm9283900.jpg>
 <img src="gm66001.jpg>

Since we do not wish to modify any other image tags in the html document, we must be careful how we contruct our real expression for pattern matching.

 :%s/\(<img src="gm[0-9]*\.\)jpg/\1gif/

Now we will break down each component of the real expression to understand better how it works as a whole.

The syntax for a search and replace in vi/vim is:

 :%s/<string1>/<string2>

We want to stuff part of the string in memory, as we do not wish to modify that particular part of the string, but wish to include it in the pattern matching as to be specific enough to avoid unwanted matches.

\(

Backlash and Parenthese to tell vi where to start storing in memory the following set of characters

<img src="gm[0-9]*\.

Which include the first portion of an html image tag, and the constant 'gm' which is in all the images we want to mdofiy, followed by [0-9] to match any number * (multiplier) as many times as needed until \. a literal period.

 \)jpg

Stop storing in memory here so that 'jpg' is excluded from the memory stored string. We want to throw away 'jpg.

 \1gif

The \1 recalls everything stored in memory, which is the exact pattern matched in the first part of the vi statement, and appends onto the end the 'gif' extension.

It is possible to remove all leading spaces from a text file in vi using real expressions. Type:

 :s/^  *//

Notice that it was necessary to have two spaces after the ^ carrot.

Some more examples:

I have an html file where all the sourcecode is on a single line. I want to create line breaks {CR} after each <tr> table row starts and concludes.

 :%s/<tr>/<tr>/g
 :%s/<\/tr>/<\/tr>/g

By default, vi/vim will match only once per line. the 'g' and the end tells vi to match multiple times per line. In vi the substitution command :%s/ / /gc is used. The percent refers to the ex-range 'whole file' and can be replaced by any appropriate range. E.g in vim you type shift-v, mark an area and then use the substitution on that area only. I don't explain more about vim here as this would be a tutorial on its own. The 'gc' is the interactive version. The no interactive is s/ / /g

Now to create a line break {CR} for each <td> table cell tag:

 :%s/<td>/<td>/g
 :%s/<\/td>/<\/td>/g

To add some indentation in the html source code:

 :%s/<tr>/  <tr>/g
 :%s/<\/tr>/  <\/tr>/g
 :%s/<\/td>/    <\/td>/g
 :%s/<td>/    <td>/g

Change all tags that reference some old graphics directory to use our /images subdirectory:

 :%s/<img src=.*graphics/<img src=images/

To use the /images directory on the local server as opposed to a remote server in your html document:

 :%s/img src="http:\/\/www.geoshitties.com\/blockhead src="images/

Add a file extension of .gif on a bunch of images that start with a specific string of text and do not currently have a file extension:

 :%s/\(moparPN_[0-9]*_[0-9]*\)/\1.gif/

Starting at line 6 we want to add a two space indent to every line in a text file:

 :6,$s/\(.*\)/\ \ \1/

Preventing RegEx Greediness: * will match as many characters as possible. Usually you want to be greedy, but not always. If you don't want to be greedy, add a question mark after the *

For example, if you want to strip HTML tags the following won't work the way you probably want.

  :%s/<.*>//sg;       # Matches TOO Much! It sees "<html>...</html>" 

as one big /<.*>/ (starts with "<" and ends with ">").

  :%s/<.*?>//g;      # Solves the greediness problem by using the shortest
                       possible match. 
  :%s/<[^>]*>//g;    # Also works.