This article has been
excerpted from book "The Complete Visual C# Programmer's Guide" from the Authors
of C# Corner.
A regular expression is a string of characters that contains a pattern to find
the string or strings you are looking for. In its simplest form, a regular
expression is just a word or phrase to search for in the
source string. Regular expressions include metacharacters which are special
characters that add a great deal of flexibility and convenience to the search.
Regular expressions have their origins in automata theory and formal language
theory, which study models of computation automata and ways to describe and
classify formal languages. In theoretical computer science, a formal language is
nothing but a set of strings.
In the 1940s, two mathematicians, Warren McCulloch and Walter Pitts, described
the nervous system by modeling neurons. Later, mathematician Stephen Kleene
described these models using his mathematical notation called regular sets and
developed regular expressions as a notation for describing them.
Afterward, Ken Thompson, one of the key creators of the Unix Operating System,
built regular expressions into Unix-based text tools like qed, (predecessor of
the Unix ed) and grep. Since then, regular expressions have been widely used in
Windows and Unix.
Patterns
Let's examine two regular expression patterns:
Pattern#1 Regex objNotNaturalPattern=new Regex("[^0-9]");
Pattern#2 Regex objNaturalPattern=new Regex("0*[1-9][0-9]*");
Pattern #1 will match for strings other than those containing numbers from 0 to
9 (^ = not). (Use brackets to give range values, such as 0-9, a-z, or
A-Z.) For
example the string abc will return true when you apply the regular expression
for pattern #1, but the string 123 will return false for this same pattern.
Pattern #2 will match for strings that contain only natural numbers (numbers
that are always greater than 0). The pattern 0* indicates that a natural number
can be prefixed with any number of zeroes or no zeroes at all. The next pattern,
[1-9], means that it should contain at least one integer between 1 and 9
(including 1 and 9). The next pattern, [0-9]*, indicates that it should end with
any number of integers between 0 and 9. For example, 0007 returns true, whereas
00 returns false.
Here are basic pattern metacharacters used by RegEx:
- * = zero or more
- ? = zero or one
- ^ = not
- [] = range
Metacharacters
The period is a widely used wildcard metacharacter. It matches exactly one
character and does not care what the character is. For example, the regular
expression 5,.-AnyString will match 5,8- AnyString and 5,9-AnyString. The period
will not match a string of characters or the null string. Thus, 5,800-AnyString
and 5,-AnyString will not be matched by the regular expression above.
What if you want to search for a string containing a period? For example, we may
wish to search for references to the mathematical constant pi. Bear in mind that
the regular expression 3.14 would match 3.14, 3914, 3g14, and even 3*14.
We can get around this with a second metacharacter, the backslash, which
indicates that the character following it must be taken as a literal character.
If we want to search for the string 3.14, 650 C# Corner we would use 3\ .14. In
regular expression terminology, this operation is called quoting, and the period
in the regular expression is said to be quoted.
Be careful when using the backslash to quote since it has another function when
used in escape sequences: such as \n, \r, \t, \f, \b, \B, \0, \1, \2, \3, \4,
\5, \6, \7, \8, and \9. Note that these are forbidden search strings in regular
expressions. You should quote a metacharacter that turns the search
character(e.g. \..) into a normal character, but be careful when you quote a
normal character(e.g. \9) that may turn the search string into a metacharacter.
The question mark indicates that the character immediately preceding it either
appears zero or one time. For example, A?nyString would match either nyString
and AnyString; Another?String would match either AnotheString and AnotherString.
The star, or asterisk, indicates that the character to its left can be repeated
zero or any number of times. For example, XY*Z would match XZ, XYZ, XYYZ, XYYYZ,
or XYYYYYYYYZ. In other words, any string is satisfactory if it starts with an
X, is followed by a sequence of any number of Y characters, and ends with a Z.
The plus metacharacter is just like the star metacharacter except that it
doesn't match the null string. For example, XY+Z would not match XZ but would
match XYZ, XYYZ, XYYYZ, or XYYYYYYYYZ.
Many metacharacters can be combined. A practical combination is the period
followed by the star, which matches a string of any length, even the null
string. For example, AnyString.*ade would match AnyStringFecade, AnyStringFacade,
AnyString of steel made, and even AnyStringade. It matches any string starting
with AnyString, followed by any string or the null string, and ending with ade.
If you want to search for AnyStringDecade and AnyStringFacade but do not want to
match AnyString of steel made, you could string together three periods:
AnyString...ade. Only strings 15 characters long which start with AnyString and
end with ade will be matched.
Now, with x\ .*z you will match any string that starts with x, is followed by a
series of periods or no period, and ends with z-for example, xz, x.z, x..z, or
x...z.
The expression x.\*z will match any string that starts with x, is followed by
one arbitrary character, and ends with *z: xf*z, x9*z, x@*z. The expression
x\++z will match any string that starts with x, is followed by one or a series
of plus signs, and is terminated by z. Thus, xz is not matched, but x+z, x++z,
and x+++z are.
The expression b.?t will match but, bat, bot, and any other three-character
string that begins with b and ends with t, and will also match bt. The
expression b\.?t will match only bt and b.t. The expression b.\?t will match any
four-character string that starts with b and ends with ?t: bu?t, b9?t, b@?t. The
expression b\.\?t will match only b.?t.
We mentioned that the backslash can turn ordinary characters into metacharacters
and vice versa. One example is the digit metacharacter, \d, which will match
exactly one digit. For example, 5,\d- AnyString will match 5,5-AnyString and
5,9-AnyString. Also, 5\ .\d\d\d\d will match any five-digit floating-point
number from 5.0000 to 5.9999.
We can combine the the digit metacharacter with other metacharacters. For
example, x\d+z will match any string that starts with x, is followed by a string
of numbers, and ends with z. Note that since the plus sign is used, the
expression will not match az.
In the digit metacharacter, the letter d must be lowercase because the nondigit
metacharacter, \D, uses the uppercase D. The nondigit metacharacter will match
any character except a digit. For example, x\Dz will match xyz, xYz, or x@z, but
not x0z, x1z, or a9z. Most metacharacters using a backslash take the inverse
meaning with an uppercase letter.
The word metacharacter, \w, matches exactly one letter, one number, or the
underscore character. Its inverse, \W, matches any one character except a
letter, a number, or the underscore. For example, x\wz will match xyz, xYz, x9z,
x_z, or any other three-character string that starts with x, has a second
character that is either a letter, a number, or the underscore, and ends with z.
The white-space metacharacter, \s, matches exactly one character of white
space-spaces, tabs, new lines, or any other character that cannot be seen when
printed. Its opposite, \S, matches any character that is not white space. For
example, x\sz will match any three-character string that starts with x, has a
second character that is a space, tab, or new line, and ends with z. The
expression x\Sz will match any three-character string that starts with x, has a
second character that is not a space, tab, or new line, and ends with z.
The word-boundary metacharacter, \b, matches whole words bounded by spaces or
punctuation that have the same beginning. Its opposite, \B, matches whole words
that have a different beginning. For example, \bcommut will match commuter or
commuting, but will not match telecommuter since there is no space or
punctuation between tele and commuter. The expression \Bcommut will not match a
word like commuter or commuting unless it is part of a larger word such as
telecommuter or telecommuting. The underscore is considered a "word" character.
For example, tele\bcommuter will not match tele_commuter, but would match tele
commuter and tele-commuter.
The octal metacharacter, \nnn, where n is a number from zero to seven, is
generally used to specify control characters that have no typed equivalent. For
example, \007 will match an embedded ASCII bell character, the ASCII value of 7.
The braces metacharacter follows a normal character and contains two numbers
separated by a comma and surrounded by braces. It acts like the star
metacharacter, but the length of the string it matches must be within the
minimum and maximum length specified by the two numbers in braces. For example,
xy{3,5}y will match only xyyyz, xyyyyz, and xyyyyyz. The expression .{2,4}ade
will match cascade, facade, arcade, or decade, but not fade since f is only one
character long.
The vertical bar metacharacter indicates an either/or choice. For example,
mystery|myth|arcane will match strings with either mystery or myth or arcane or
any combination of all three.
The brackets metacharacter matches one occurrence of any character inside the
brackets. For example, \s[bgh]ut\s will match but, gut, and hut, but not tut,
xut, or zut. The expression 5,[89]- AnyString will match 5,8-AnyString and
5,9-AnyString, but not 5,88-AnyString, 5,89-AnyString, or 5,-AnyString.
A range of characters within the brackets can be indicated with a hyphen, or
dash. For example, x[jm] z will match only xjz, xkz, xlz, and xmz. The
expression AnyFile0[7-9] will match only AnyFile07, AnyFile08, and AnyFile09.
If you want to include a dash within brackets as one of the characters to match,
simply put it before the right bracket. For example, x[1234-]z and x[1-4-]z will
match the same strings: x1z, x2z, x3z, x4z, and x-z, but nothing else.
The bracket metacharacter can also be reversed by placing a caret metacharacter
after the left bracket, letting you specify a range or list to exclude. For
example, AnyFile0[^02468] will match 652 C# Corner any nine-character string
that starts with AnyFile0 and ends with anything except an even number. You can
combine inversion and ranges as well. For example, \W[^f-h]ood\W will match any
fourletter wording ending in ood except for food, good, or hood.
Within brackets, ordinary quoting rules do not apply and other metacharacters
are not available. The only characters that can be quoted are the left and right
brackets and the backslash. For example, [\[\\\]]xyz will match any
four-character string that ends with xyz and starts with [, ], or \.
Perhaps the most powerful element of regular expression syntax is the
backreference, where results of a subpattern are loaded into a buffer for reuse
later in the expression. Parentheses identify backreference patterns, and the
buffers are numbered as each begin parenthesis is encountered from left to right
in the expression. Buffer numbers begin at 1 and continue up to a maximum of n
subexpressions allowed by the .NET Framework:
- If you search [abc]([def]) in be, the
first backreference match will be e.
- If you search ([abc])([def]) in be, the
first backreference match will be b and the second backreference match will
be e.
- If you search (ab(cd))ef in abcdef, the
first backreference match will be abcd and the second backreference match
will be cd.
- If you search (a)+b* in aaaabbb, the first
backreference match will be a.
- If you search (a+)b* in aaaabbb, the first
backreference match will be aaaa.
- If you search ([abc])+ in aaabbbc, the
first backreference match will be c.
You can access each buffer by using the form
\n, where n is one- or two-decimal digits identifying a specific buffer: \1
identifies the first buffer. For example, the regular expression (\d )\1 could
match 44, 55, or 99, but wouldn't match 24 or 83.
One of the simplest, most useful applications of backreferences is to locate the
occurrence of two identical words together-for example, in Were you drunk or
sober last night night night? The expression \b([a-z]+) \1\b will match night
night.
To be complete, a backreference expression must be enclosed in parentheses. The
expression (\w(\1)) contains an invalid backreference since the first set of
parentheses is not complete where the backreference appears.
Here is a more advanced example where we validate a URI (universal resource
identifier), such as http://www.mindcracker.com:8080/myfolder/index.html#content1.
The regular expression (\w+):\/\/([^/:]+)(:\d*)?([^# ]*) does the following:
- (\w+):\/\/ matches any word that precedes
a colon and two forward slashes.
- ([^/:]+) captures the domain address part:
any sequence of characters that does not include the caret, forward slash,
or colon.
- (:\d*) captures a Web site port number, if
it is specified: zero or more digits following a colon.
- ([^# ]*) captures the subdirectory and the
page address specified by the Web URI: one or more characters other than #
or the space character.
The first backreference will be http, the
second backreference will be www.mindcracker.com, the third backreference will
be :8080, and the fourth backreference will be /myfolder/index.html.
Backreferences allow for strings of data that change slightly from instance to
instance-such as page numbering schemes. We may have a document that numbers
each page with the notation <page n="[some number]" id n="[some chapter name]">;
the number and the chapter name Strings and Arrays 653 change from page to page,
but the rest of the string stays the same. We can write a simple regular
expression that matches these subpatterns:
<page n="\([0-9]+\)" id="\([A-Za-z]+\)">/Page \1, Chapter \2
Buffer number one (\1) holds the first matched sequence, ([0-9]+); buffer number
two (\2) holds the second, ([A-Za-z]+).
Listing 20.42 shows the code for validating strings entered against various
regular expression patterns.
Listing 20.42: Regular Expressions
//regular expressions
using
System.Text.RegularExpressions;
using System;
class
Validation
{
public static
void Main()
{
String strToTest;
Validation objValidate =
new Validation();
Console.Write("Enter
a String to Test for Natural Numbers:");
strToTest = Console.ReadLine();
if (objValidate.IsNaturalNumber(strToTest))
{
Console.WriteLine("{0}
is a Valid Natural Number",
strToTest);
}
else
{
Console.WriteLine("{0}
is not a Valid Natural Number",
strToTest);
}
Console.Write("Enter
a String to Test for Whole Numbers:");
strToTest = Console.ReadLine();
if (objValidate.IsWholeNumber(strToTest))
{
Console.WriteLine("{0}
is a Valid Whole Number", strToTest);
}
else
{
Console.WriteLine("{0}
is not a Valid Whole Number",
strToTest);
}
Console.Write("Enter
a String to Test for Integers:");
strToTest = Console.ReadLine();
if (objValidate.IsInteger(strToTest))
{
Console.WriteLine("{0}
is a Valid Integer", strToTest);
}
else
{
Console.WriteLine("{0}
is not a Valid Integer", strToTest);
}
Console.Write("Enter
a String to Test for Positive Numbers:");
strToTest = Console.ReadLine();
if (objValidate.IsPositiveNumber(strToTest))
{
Console.WriteLine("{0}
is a Valid Positive Number",
strToTest);
}
else
{
Console.WriteLine("{0}
is not a Valid Positive Number",
strToTest);
}
Console.Write("Enter
a String to Test for Numbers:");
strToTest = Console.ReadLine();
if (objValidate.IsNumber(strToTest))
{
Console.WriteLine("{0}
is a Valid Number", strToTest);
}
else
{
Console.WriteLine("{0}
is not a Valid Number", strToTest);
}
Console.Write("Enter
a String to Test for Alpha Numerics:");
strToTest = Console.ReadLine();
if (objValidate.IsAlphaNumeric(strToTest))
{
Console.WriteLine("{0}
is a Valid Alpha Numeric", strToTest);
}
else
{
Console.WriteLine("{0}
is not a Valid Alpha Numeric",
strToTest);
}
Console.Write("Enter
a String to Test for Alphabets:");
strToTest = Console.ReadLine();
if (objValidate.IsAlpha(strToTest))
{
Console.WriteLine("{0}
is a Valid Alpha String", strToTest);
}
else
{
Console.WriteLine("{0}
is not a Valid Alpha String",
strToTest);
}
}
// Function to test for Positive Integers
public bool
IsNaturalNumber(String strNumber)
{
Regex objNotNaturalPattern =
new Regex("[^0-9]");
Regex objNaturalPattern =
new Regex("0*[1-9][0-9]*");
return !objNotNaturalPattern.IsMatch(strNumber)
&&
objNaturalPattern.IsMatch(strNumber);
}
// Function to test for Positive Integers with
zero inclusive
public bool
IsWholeNumber(String strNumber)
{
Regex objNotWholePattern =
new Regex("[^0-9]");
return !objNotWholePattern.IsMatch(strNumber);
}
// Function to Test for Integers both Positive &
Negative
public bool
IsInteger(String strNumber)
{
Regex objNotIntPattern =
new Regex("[^0-9-]");
Regex objIntPattern =
new Regex("^-[0-9]+$|^[0-9]+$");
return !objNotIntPattern.IsMatch(strNumber)
&&
objIntPattern.IsMatch(strNumber);
}
// Function to Test for Positive Number both
Integer & Real
public bool
IsPositiveNumber(String strNumber)
{
Regex objNotPositivePattern =
new Regex("[^0-9.]");
Regex objPositivePattern =
new Regex(
"^[.][0-9]+$|[0-9]*[.]*[0-9]+$");
Regex objTwoDotPattern =
new Regex("[0-9]*[.][0-9]*[.][0-9]*");
return !objNotPositivePattern.IsMatch(strNumber)
&&
objPositivePattern.IsMatch(strNumber) &&
!objTwoDotPattern.IsMatch(strNumber);
}
// Function to test whether the string is valid
number or not
public bool
IsNumber(String strNumber)
{
Regex objNotNumberPattern =
new Regex("[^0-9.-]");
Regex objTwoDotPattern =
new Regex("[0-9]*[.][0-9]*[.][0-9]*");
Regex objTwoMinusPattern =
new Regex("[0-9]*[-][0-9]*[-][0-9]*");
String strValidRealPattern =
"^([-]|[.]|[-.]|[0-9])[0-9]*[.]*[0-9]+$";
String strValidIntegerPattern =
"^([-]|[0-9])[0-9]*$";
Regex objNumberPattern =
new Regex("("
+ strValidRealPattern
+ ")|(" + strValidIntegerPattern +
")");
return !objNotNumberPattern.IsMatch(strNumber)
&&
!objTwoDotPattern.IsMatch(strNumber) &&
!objTwoMinusPattern.IsMatch(strNumber) &&
objNumberPattern.IsMatch(strNumber);
}
// Function To test for Alphabets.
public bool
IsAlpha(String strToCheck)
{
Regex objAlphaPattern =
new Regex("[^a-zA-Z]");
return !objAlphaPattern.IsMatch(strToCheck);
}
// Function to Check for AlphaNumeric.
public bool
IsAlphaNumeric(String strToCheck)
{
Regex objAlphaNumericPattern =
new Regex("[^a-zA-Z0-9]");
return !objAlphaNumericPattern.IsMatch(strToCheck);
}
}
Split and Match Methods
There are a few significant RegEx methods:
The RegEx.Split method splits an input string into an array of substrings at the
positions defined by a regular expression match.
The RegEx.Replace method replaces all occurrences of a character pattern defined
by a regular expression with a specified replacement character string.
The RegEx.Matches method searches an input string for all occurrences of a
regular expression and returns all the successful matches as if Match were
called numerous times.
There is also a MatchCollection class that represents the set of successful
matches found by iteratively applying a regular expression pattern to the input
string.
Listing 20.43 illustrates the Split and Matches methods and the MatchCollection
class.
Listing 20.43: Split and Match Examples
using System;
using
System.Text.RegularExpressions;
public
class RegExpSplit
{
public static
void Main(string[]
args)
{
Console.WriteLine(@"Enter
a split delimeter ( default is
[0-9 a-z
A-Z]* ) : ");
metaExp = Console.ReadLine();
Console.WriteLine(@"Enter
a meta string: ");
string[] rets = ParseExtnSplit(Console.ReadLine());
if (rets ==
null)
{
Console.WriteLine("Sorry
no match");
}
else
{
Console.WriteLine(rets.Length);
foreach (string
x in rets)
Console.WriteLine(x);
}
Console.WriteLine(@"Enter
a match pattern ( default is [0-9 a-z A-Z]* ) : ");
metaExp = Console.ReadLine();
Console.WriteLine(@"Enter
a meta string: ");
rets = ParseExtnMatch(Console.ReadLine());
if (rets ==
null)
{
Console.WriteLine("Sorry
no match");
}
else
{
Console.WriteLine(rets.Length);
foreach (string
x in rets)
Console.WriteLine(x);
}
}
public static
string[] ParseExtnSplit(String
ext)
{
Regex rx =
new Regex(metaExp);
return rx.Split(ext);
}
public static
string[] ParseExtnMatch(String
ext)
{
// case insensitive match
Regex rx =
new Regex(metaExp,
RegexOptions.IgnoreCase);
MatchCollection rez = rx.Matches(ext);
string[] ret =
null;
if (rez.Count > 0)
{
ret = new
string[rez.Count];
for (int
i = 0; i < rez.Count; i++)
{
ret[i] = rez[i].ToString();
}
}
return ret;
}
private static
string metaExp =
"[0-9 a-z A-Z]*";
}
Conclusion
Hope this article would have helped you in understanding the Regular Expressions
in C#. See other articles on the website on .NET and C#.
|
The Complete Visual
C# Programmer's Guide covers most of the major components that make
up C# and the .net environment. The book is geared toward the
intermediate programmer, but contains enough material to satisfy the
advanced developer. |