Regular Expressions in C#

Puran Mehra
14y
178.9k
0
1

Article

This article has been excerpted from book "The Complete Visual C# Programmer's Guide" from the Authors of C# Corner.

A regular expression is a string of characters that contains a pattern to find the string or strings you are looking for. In its simplest form, a regular expression is just a word or phrase to search for in the source string. Regular expressions include metacharacters which are special characters that add a great deal of flexibility and convenience to the search.

Regular expressions have their origins in automata theory and formal language theory, which study models of computation automata and ways to describe and classify formal languages. In theoretical computer science, a formal language is nothing but a set of strings.

In the 1940s, two mathematicians, Warren McCulloch and Walter Pitts, described the nervous system by modeling neurons. Later, mathematician Stephen Kleene described these models using his mathematical notation called regular sets and developed regular expressions as a notation for describing them.

Afterward, Ken Thompson, one of the key creators of the Unix Operating System, built regular expressions into Unix-based text tools like qed, (predecessor of the Unix ed) and grep. Since then, regular expressions have been widely used in Windows and Unix.

Patterns

Let's examine two regular expression patterns:

Pattern#1 Regex objNotNaturalPattern=new Regex("[^0-9]");

Pattern#2 Regex objNaturalPattern=new Regex("0*[1-9][0-9]*");

Pattern #1 will match for strings other than those containing numbers from 0 to 9 (^ = not). (Use brackets to give range values, such as 0-9, a-z, or
A-Z.) For example the string abc will return true when you apply the regular expression for pattern #1, but the string 123 will return false for this same pattern.

Pattern #2 will match for strings that contain only natural numbers (numbers that are always greater than 0). The pattern 0* indicates that a natural number can be prefixed with any number of zeroes or no zeroes at all. The next pattern, [1-9], means that it should contain at least one integer between 1 and 9 (including 1 and 9). The next pattern, [0-9]*, indicates that it should end with any number of integers between 0 and 9. For example, 0007 returns true, whereas 00 returns false.

Here are basic pattern metacharacters used by RegEx:

* = zero or more
? = zero or one
^ = not
[] = range

Metacharacters

The period is a widely used wildcard metacharacter. It matches exactly one character and does not care what the character is. For example, the regular expression 5,.-AnyString will match 5,8- AnyString and 5,9-AnyString. The period will not match a string of characters or the null string. Thus, 5,800-AnyString and 5,-AnyString will not be matched by the regular expression above.

What if you want to search for a string containing a period? For example, we may wish to search for references to the mathematical constant pi. Bear in mind that the regular expression 3.14 would match 3.14, 3914, 3g14, and even 3*14.

We can get around this with a second metacharacter, the backslash, which indicates that the character following it must be taken as a literal character. If we want to search for the string 3.14, 650 C# Corner we would use 3\ .14. In regular expression terminology, this operation is called quoting, and the period in the regular expression is said to be quoted.

Be careful when using the backslash to quote since it has another function when used in escape sequences: such as \n, \r, \t, \f, \b, \B, \0, \1, \2, \3, \4, \5, \6, \7, \8, and \9. Note that these are forbidden search strings in regular expressions. You should quote a metacharacter that turns the search character(e.g. \..) into a normal character, but be careful when you quote a normal character(e.g. \9) that may turn the search string into a metacharacter.

The question mark indicates that the character immediately preceding it either appears zero or one time. For example, A?nyString would match either nyString and AnyString; Another?String would match either AnotheString and AnotherString.

The star, or asterisk, indicates that the character to its left can be repeated zero or any number of times. For example, XY*Z would match XZ, XYZ, XYYZ, XYYYZ, or XYYYYYYYYZ. In other words, any string is satisfactory if it starts with an X, is followed by a sequence of any number of Y characters, and ends with a Z.

The plus metacharacter is just like the star metacharacter except that it doesn't match the null string. For example, XY+Z would not match XZ but would match XYZ, XYYZ, XYYYZ, or XYYYYYYYYZ.

Many metacharacters can be combined. A practical combination is the period followed by the star, which matches a string of any length, even the null string. For example, AnyString.*ade would match AnyStringFecade, AnyStringFacade, AnyString of steel made, and even AnyStringade. It matches any string starting with AnyString, followed by any string or the null string, and ending with ade.

If you want to search for AnyStringDecade and AnyStringFacade but do not want to match AnyString of steel made, you could string together three periods: AnyString...ade. Only strings 15 characters long which start with AnyString and end with ade will be matched.

Now, with x\ .*z you will match any string that starts with x, is followed by a series of periods or no period, and ends with z-for example, xz, x.z, x..z, or x...z.

The expression x.\*z will match any string that starts with x, is followed by one arbitrary character, and ends with *z: xf*z, x9*z, x@*z. The expression x\++z will match any string that starts with x, is followed by one or a series of plus signs, and is terminated by z. Thus, xz is not matched, but x+z, x++z, and x+++z are.

The expression b.?t will match but, bat, bot, and any other three-character string that begins with b and ends with t, and will also match bt. The expression b\.?t will match only bt and b.t. The expression b.\?t will match any four-character string that starts with b and ends with ?t: bu?t, b9?t, b@?t. The expression b\.\?t will match only b.?t.

We mentioned that the backslash can turn ordinary characters into metacharacters and vice versa. One example is the digit metacharacter, \d, which will match exactly one digit. For example, 5,\d- AnyString will match 5,5-AnyString and 5,9-AnyString. Also, 5\ .\d\d\d\d will match any five-digit floating-point number from 5.0000 to 5.9999.

We can combine the the digit metacharacter with other metacharacters. For example, x\d+z will match any string that starts with x, is followed by a string of numbers, and ends with z. Note that since the plus sign is used, the expression will not match az.

In the digit metacharacter, the letter d must be lowercase because the nondigit metacharacter, \D, uses the uppercase D. The nondigit metacharacter will match any character except a digit. For example, x\Dz will match xyz, xYz, or x@z, but not x0z, x1z, or a9z. Most metacharacters using a backslash take the inverse meaning with an uppercase letter.

The word metacharacter, \w, matches exactly one letter, one number, or the underscore character. Its inverse, \W, matches any one character except a letter, a number, or the underscore. For example, x\wz will match xyz, xYz, x9z, x_z, or any other three-character string that starts with x, has a second character that is either a letter, a number, or the underscore, and ends with z.

The white-space metacharacter, \s, matches exactly one character of white space-spaces, tabs, new lines, or any other character that cannot be seen when printed. Its opposite, \S, matches any character that is not white space. For example, x\sz will match any three-character string that starts with x, has a second character that is a space, tab, or new line, and ends with z. The expression x\Sz will match any three-character string that starts with x, has a second character that is not a space, tab, or new line, and ends with z.

The word-boundary metacharacter, \b, matches whole words bounded by spaces or punctuation that have the same beginning. Its opposite, \B, matches whole words that have a different beginning. For example, \bcommut will match commuter or commuting, but will not match telecommuter since there is no space or punctuation between tele and commuter. The expression \Bcommut will not match a word like commuter or commuting unless it is part of a larger word such as telecommuter or telecommuting. The underscore is considered a "word" character. For example, tele\bcommuter will not match tele_commuter, but would match tele commuter and tele-commuter.

The octal metacharacter, \nnn, where n is a number from zero to seven, is generally used to specify control characters that have no typed equivalent. For example, \007 will match an embedded ASCII bell character, the ASCII value of 7.

The braces metacharacter follows a normal character and contains two numbers separated by a comma and surrounded by braces. It acts like the star metacharacter, but the length of the string it matches must be within the minimum and maximum length specified by the two numbers in braces. For example, xy{3,5}y will match only xyyyz, xyyyyz, and xyyyyyz. The expression .{2,4}ade will match cascade, facade, arcade, or decade, but not fade since f is only one character long.

The vertical bar metacharacter indicates an either/or choice. For example, mystery|myth|arcane will match strings with either mystery or myth or arcane or any combination of all three.

The brackets metacharacter matches one occurrence of any character inside the brackets. For example, \s[bgh]ut\s will match but, gut, and hut, but not tut, xut, or zut. The expression 5,[89]- AnyString will match 5,8-AnyString and 5,9-AnyString, but not 5,88-AnyString, 5,89-AnyString, or 5,-AnyString.

A range of characters within the brackets can be indicated with a hyphen, or dash. For example, x[jm] z will match only xjz, xkz, xlz, and xmz. The expression AnyFile0[7-9] will match only AnyFile07, AnyFile08, and AnyFile09.

If you want to include a dash within brackets as one of the characters to match, simply put it before the right bracket. For example, x[1234-]z and x[1-4-]z will match the same strings: x1z, x2z, x3z, x4z, and x-z, but nothing else.

The bracket metacharacter can also be reversed by placing a caret metacharacter after the left bracket, letting you specify a range or list to exclude. For example, AnyFile0[^02468] will match 652 C# Corner any nine-character string that starts with AnyFile0 and ends with anything except an even number. You can combine inversion and ranges as well. For example, \W[^f-h]ood\W will match any fourletter wording ending in ood except for food, good, or hood.

Within brackets, ordinary quoting rules do not apply and other metacharacters are not available. The only characters that can be quoted are the left and right brackets and the backslash. For example, [\[\\\]]xyz will match any four-character string that ends with xyz and starts with [, ], or \.

Perhaps the most powerful element of regular expression syntax is the backreference, where results of a subpattern are loaded into a buffer for reuse later in the expression. Parentheses identify backreference patterns, and the buffers are numbered as each begin parenthesis is encountered from left to right in the expression. Buffer numbers begin at 1 and continue up to a maximum of n subexpressions allowed by the .NET Framework:

If you search [abc]([def]) in be, the first backreference match will be e.
If you search ([abc])([def]) in be, the first backreference match will be b and the second backreference match will be e.
If you search (ab(cd))ef in abcdef, the first backreference match will be abcd and the second backreference match will be cd.
If you search (a)+b* in aaaabbb, the first backreference match will be a.
If you search (a+)b* in aaaabbb, the first backreference match will be aaaa.
If you search ([abc])+ in aaabbbc, the first backreference match will be c.

You can access each buffer by using the form \n, where n is one- or two-decimal digits identifying a specific buffer: \1 identifies the first buffer. For example, the regular expression (\d )\1 could match 44, 55, or 99, but wouldn't match 24 or 83.

One of the simplest, most useful applications of backreferences is to locate the occurrence of two identical words together-for example, in Were you drunk or sober last night night night? The expression \b([a-z]+) \1\b will match night night.

To be complete, a backreference expression must be enclosed in parentheses. The expression (\w(\1)) contains an invalid backreference since the first set of parentheses is not complete where the backreference appears.

Here is a more advanced example where we validate a URI (universal resource identifier), such as http://www.mindcracker.com:8080/myfolder/index.html#content1. The regular expression (\w+):\/\/([^/:]+)(:\d*)?([^# ]*) does the following:

(\w+):\/\/ matches any word that precedes a colon and two forward slashes.
([^/:]+) captures the domain address part: any sequence of characters that does not include the caret, forward slash, or colon.
(:\d*) captures a Web site port number, if it is specified: zero or more digits following a colon.
([^# ]*) captures the subdirectory and the page address specified by the Web URI: one or more characters other than # or the space character.

The first backreference will be http, the second backreference will be www.mindcracker.com, the third backreference will be :8080, and the fourth backreference will be /myfolder/index.html.

Backreferences allow for strings of data that change slightly from instance to instance-such as page numbering schemes. We may have a document that numbers each page with the notation <page n="[some number]" id n="[some chapter name]">; the number and the chapter name Strings and Arrays 653 change from page to page, but the rest of the string stays the same. We can write a simple regular expression that matches these subpatterns:

<page n="$[0-9]+$" id="$[A-Za-z]+$">/Page \1, Chapter \2

Buffer number one (\1) holds the first matched sequence, ([0-9]+); buffer number two (\2) holds the second, ([A-Za-z]+).

Listing 20.42 shows the code for validating strings entered against various regular expression patterns.

Listing 20.42: Regular Expressions

//regular expressions

using System.Text.RegularExpressions;

using System;

class Validation

{

public static void Main()

{

String strToTest;

Validation objValidate = new Validation();

Console.Write("Enter a String to Test for Natural Numbers:");

strToTest = Console.ReadLine();

if (objValidate.IsNaturalNumber(strToTest))

{

Console.WriteLine("{0} is a Valid Natural Number",

strToTest);

}

else

{

Console.WriteLine("{0} is not a Valid Natural Number",

strToTest);

}

Console.Write("Enter a String to Test for Whole Numbers:");

strToTest = Console.ReadLine();

if (objValidate.IsWholeNumber(strToTest))

{

Console.WriteLine("{0} is a Valid Whole Number", strToTest);

}

else

{

Console.WriteLine("{0} is not a Valid Whole Number",

strToTest);

}

Console.Write("Enter a String to Test for Integers:");

strToTest = Console.ReadLine();

if (objValidate.IsInteger(strToTest))

{

Console.WriteLine("{0} is a Valid Integer", strToTest);

}

else

{

Console.WriteLine("{0} is not a Valid Integer", strToTest);

}

Console.Write("Enter a String to Test for Positive Numbers:");

strToTest = Console.ReadLine();

if (objValidate.IsPositiveNumber(strToTest))

{

Console.WriteLine("{0} is a Valid Positive Number",

strToTest);

}

else

{

Console.WriteLine("{0} is not a Valid Positive Number",

strToTest);

}

Console.Write("Enter a String to Test for Numbers:");

strToTest = Console.ReadLine();

if (objValidate.IsNumber(strToTest))

{

Console.WriteLine("{0} is a Valid Number", strToTest);

}

else

{

Console.WriteLine("{0} is not a Valid Number", strToTest);

}

Console.Write("Enter a String to Test for Alpha Numerics:");

strToTest = Console.ReadLine();

if (objValidate.IsAlphaNumeric(strToTest))

{

Console.WriteLine("{0} is a Valid Alpha Numeric", strToTest);

}

else

{

Console.WriteLine("{0} is not a Valid Alpha Numeric",

strToTest);

}

Console.Write("Enter a String to Test for Alphabets:");

strToTest = Console.ReadLine();

if (objValidate.IsAlpha(strToTest))

{

Console.WriteLine("{0} is a Valid Alpha String", strToTest);

}

else

{

Console.WriteLine("{0} is not a Valid Alpha String",

strToTest);

}

// Function to test for Positive Integers

public bool IsNaturalNumber(String strNumber)

{

Regex objNotNaturalPattern = new Regex("[^0-9]");

Regex objNaturalPattern = new Regex("0*[1-9][0-9]*");

return !objNotNaturalPattern.IsMatch(strNumber) &&

objNaturalPattern.IsMatch(strNumber);

}

// Function to test for Positive Integers with zero inclusive

public bool IsWholeNumber(String strNumber)

{

Regex objNotWholePattern = new Regex("[^0-9]");

return !objNotWholePattern.IsMatch(strNumber);

}

// Function to Test for Integers both Positive & Negative

public bool IsInteger(String strNumber)

{

Regex objNotIntPattern = new Regex("[^0-9-]");

Regex objIntPattern = new Regex("^-[0-9]+$|^[0-9]+$");

return !objNotIntPattern.IsMatch(strNumber) &&

objIntPattern.IsMatch(strNumber);

}

// Function to Test for Positive Number both Integer & Real

public bool IsPositiveNumber(String strNumber)

{

Regex objNotPositivePattern = new Regex("[^0-9.]");

Regex objPositivePattern = new Regex(

"^[.][0-9]+$|[0-9]*[.]*[0-9]+$");

Regex objTwoDotPattern = new Regex("[0-9]*[.][0-9]*[.][0-9]*");

return !objNotPositivePattern.IsMatch(strNumber) &&

objPositivePattern.IsMatch(strNumber) &&

!objTwoDotPattern.IsMatch(strNumber);

}

// Function to test whether the string is valid number or not

public bool IsNumber(String strNumber)

{

Regex objNotNumberPattern = new Regex("[^0-9.-]");

Regex objTwoDotPattern = new Regex("[0-9]*[.][0-9]*[.][0-9]*");

Regex objTwoMinusPattern = new Regex("[0-9]*[-][0-9]*[-][0-9]*");

String strValidRealPattern =

"^([-]|[.]|[-.]|[0-9])[0-9]*[.]*[0-9]+$";

String strValidIntegerPattern = "^([-]|[0-9])[0-9]*$";

Regex objNumberPattern = new Regex("(" + strValidRealPattern

+ ")|(" + strValidIntegerPattern + ")");

return !objNotNumberPattern.IsMatch(strNumber) &&

!objTwoDotPattern.IsMatch(strNumber) &&

!objTwoMinusPattern.IsMatch(strNumber) &&

objNumberPattern.IsMatch(strNumber);

}

// Function To test for Alphabets.

public bool IsAlpha(String strToCheck)

{

Regex objAlphaPattern = new Regex("[^a-zA-Z]");

return !objAlphaPattern.IsMatch(strToCheck);

}

// Function to Check for AlphaNumeric.

public bool IsAlphaNumeric(String strToCheck)

{

Regex objAlphaNumericPattern = new Regex("[^a-zA-Z0-9]");

return !objAlphaNumericPattern.IsMatch(strToCheck);

}

Split and Match Methods

There are a few significant RegEx methods:

The RegEx.Split method splits an input string into an array of substrings at the positions defined by a regular expression match.

The RegEx.Replace method replaces all occurrences of a character pattern defined by a regular expression with a specified replacement character string.

The RegEx.Matches method searches an input string for all occurrences of a regular expression and returns all the successful matches as if Match were called numerous times.

There is also a MatchCollection class that represents the set of successful matches found by iteratively applying a regular expression pattern to the input string.

Listing 20.43 illustrates the Split and Matches methods and the MatchCollection class.

Listing 20.43: Split and Match Examples

using System;

using System.Text.RegularExpressions;

public class RegExpSplit

{

public static void Main(string[] args)

{

Console.WriteLine(@"Enter a split delimeter ( default is [0-9 a-z A-Z]* ) : ");

metaExp = Console.ReadLine();

Console.WriteLine(@"Enter a meta string: ");

string[] rets = ParseExtnSplit(Console.ReadLine());

if (rets == null)

{

Console.WriteLine("Sorry no match");

}

else

{

Console.WriteLine(rets.Length);

foreach (string x in rets)

Console.WriteLine(x);

}

Console.WriteLine(@"Enter a match pattern ( default is [0-9 a-z A-Z]* ) : ");

metaExp = Console.ReadLine();

Console.WriteLine(@"Enter a meta string: ");

rets = ParseExtnMatch(Console.ReadLine());

if (rets == null)

{

Console.WriteLine("Sorry no match");

}

else

{

Console.WriteLine(rets.Length);

foreach (string x in rets)

Console.WriteLine(x);

}

public static string[] ParseExtnSplit(String ext)

{

Regex rx = new Regex(metaExp);

return rx.Split(ext);

}

public static string[] ParseExtnMatch(String ext)

{

// case insensitive match

Regex rx = new Regex(metaExp, RegexOptions.IgnoreCase);

MatchCollection rez = rx.Matches(ext);

string[] ret = null;

if (rez.Count > 0)

{

ret = new string[rez.Count];

for (int i = 0; i < rez.Count; i++)

{

ret[i] = rez[i].ToString();

}

return ret;

}

private static string metaExp = "[0-9 a-z A-Z]*";

}

Conclusion

Hope this article would have helped you in understanding the Regular Expressions in C#. See other articles on the website on .NET and C#.

The Complete Visual C# Programmer's Guide covers most of the major components that make up C# and the .net environment. The book is geared toward the intermediate programmer, but contains enough material to satisfy the advanced developer.

Up Next

Ebook Download

View all

Programming C# 5.0

Read by 9.9k people

Download Now!

Learn

View all