|
|
|
|
|
Home
»
»
Regular Expressions
|
|
|
|
Total page views :
3329
|
|
Total downloads :
|
|
|
|
|
|
Similar ArticlesMost ReadTop RatedLatest
|
|
|
|
|
|
|
|
|
|
Introduction
Regular expressions is one of ways of search substrings in strings. It is carried out by means of viewing a string in searches of some pattern. A well-known example can be symbols "*" and "?", used in command line DOS. First of them replaces a zero or more any symbols, the second - one any symbol. So, use of a pattern of search of type "text?.*" will find files textf.txt, text1.asp and others similar, but will not find text.txt or text.htm.
Usually by means of regular expressions three actions are carried out:
- Stock-taking corresponding a pattern to a substring.
- Search and delivery to the user corresponding a pattern of substrings.
- Replacement corresponding a pattern substrings.
In practice are applied three types of machines of regular expressions.
- DFA (Deterministic Finite-state Automaton - the determined final automatic devices) machines work linearly on time as do not require recoils (and never check one symbol twice). They can guaranty find the longest string from possible. However, as DFA contains only a final condition, it cannot find the sample with the return reference and, because of absence of designs with obvious expansion, does not catch sub expressions. They are used, for example, in awk, egrep or lex.
- Traditional NFA-machines (NonDeterministic Finite-state Automaton - not determined final automatic devices) use "greedy" algorithm of recoil, checking all possible expansions of regular expression in the certain order and choosing the first suitable value. As traditional NFA designs the certain expansions of regular expression for search of conformity, it can search
and backreferences. But because of recoils traditional NFA can check the same place some times. It as a result works more slowly. As traditional NFA accepts the first found conformity, it can and not find longest of ocurrences. Such mechanisms of regular expressions are used in Perl, Python, Emacs, Tcl and .Net.
- POSIX NFA - machines are similar to traditional NFA-machines, except for "patience" - they continue search, yet will not find the longest conformity. Therefore POSIX NFA-machines more slowly traditional and consequently it is impossible to force POSIX NFA to prefer shorter conformity long. One of main advantages POSIX of the NFA-machine - presence of standard realization.
More often programmers use traditional NFA-machines as they are more exact, than DFA or POSIX NFA. Though time of their work grows in the worst case after an exhibitor, use of the samples reducing a level of ambiguity and limiting depth of search with return (backtracking), allows to operate their behavior, reducing time of search up to comprehensible values.
.NET Regular Expressions
The Regexp classes are allowed to use regular expressions in .NET. All classes are contained in the System.Text.RegularExpressions assembly, and you will have to reference the assembly at compile time in order to build your application. For example: csc /r:System.Text.RegularExpressions.dll foo.vb will build the foo.exe assembly, with a reference to the System.Text.RegularExpressions assembly.
There are only six classes and one delegate which it is necessary to overview:
- Capture: Contains the results of a single match
- CaptureCollection: A sequence of Capture's
- Group: The result of a single group capture, inherits from Capture
- Match: The result of a single expression match, inherits from Group
- MatchCollection: A sequence of Match's
- MatchEvaluator: A delegate for use during replacement operations
- Regex: An instance of a compiled regular expression
The Regex class also contains several static methods:
- Escape: Escapes regex metacharacters within a string
- IsMatch: Methods return a boolean result if the supplied regular expression matches within the string
- Match: Methods return Match instance
- Matches: Methods return a list of Match as a collection
- Replace: Methods that replace the matched regular expressions with replacement strings
- Split: Methods return an array of strings determined by the expression
- Unescape: Unescapes any escaped characters within a string
Example.
Imports System.Text.RegularExpressions
Imports System
Namespace RegularExpressionsSample
Class Program
Shared Sub Main(ByVal args As String())
Dim regularExpression As String = "s[ia]mple"
Dim inputString As String = "Is it simple sample?"
Dim m As Match = Regex.Match(inputString, regularExpression)
Console.WriteLine("Match = " + m.ToString)
Console.WriteLine("Next match = " + m.NextMatch.ToString)
End Sub
End Class
End Namespace
Output:
Match = simple Next match = sample
Regular Expression Options:
Regular Expression Options can be used in the constructor for the Regex class.
- RegexOptions.None - Specifies that no options are set.
- RegexOptions.IgnoreCase - Specifies case-insensitive matching.
- RegexOptions.Multiline - Multiline mode. Changes the meaning of ^ and $ so they match at the beginning and end, respectively, of any line, and not just the beginning and end of the entire string.
- RegexOptions.Singleline - Specifies single-line mode. Changes the meaning of the dot (.) so it matches every character (instead of every character except \n).
- RegexOptions.ExplicitCapture - Specifies that the only valid captures are groups that are explicitly named or in the form (?<name>...).
- RegexOptions.IgnorePatternWhitespace - Eliminates unescaped white space from the pattern and enables comments marked with the hash sign (#).
- RegexOptions.Compiled - Specifies that the regular expression is compiled to an assembly. The regular expression will be faster to match but it takes more time to compile initially. This option (although tempting) should only be used when the expression will be used many times. e.g. in a foreach loop
- RegexOptions.ECMAScript - Enables ECMAScript-compliant behavior for the expression. This flag can be used only in conjunction with the IgnoreCase, Multiline, and Compiled flags. The use of this flag with any other flags results in an exception.
- RegexOptions.RightToLeft - Specifies that the search will be from right to left instead of from left to right.
Bases of syntax of regular expressions
I will not begin to try to write the full directory on all symbols used in patterns of regular expressions. For this purpose there is MSDN. Here we will result only the basic metasymbols. In double inverted commas the values which are given out by regular expressions, and in unary - syntax of regular expressions will be used further.
Metasymbols which you wish to use not as those and as actually symbols, should be covered by an escape-symbol \ as in C++ (in other languages can be differently, for example, in VB it is not necessary). That is, to find "[" , it is necessary to write '\['.The symbol \ means, that the symbol following it is special symbol, a constant and so on. For example, 'n' means the letter "n." '\n' means a symbol of a new line. The sequence '\\' corresponds "\", and '\(' corresponds "(".
Classes of symbols (Character class)
Using square brackets, it is possible to specify group of symbols (it name a class of symbols) for search. For example, the design 'b[ai]rge' would correspond to words "barge" and "birge", i.e. the words beginning with "b" which follow "a" or "i", and coming to an end on "rga". Probably and the return, that is, it is possible to specify symbols which should not contain in found substring. So, '[^1-6]' finds all symbols, except for figures from 1 up to 6. It is necessary to mention, that inside of a class of symbols '\b' designates a symbol backspace (deletings).
Quantifiers
If it is not known, how many signs should contain required substring, it is possible to use special symbols, called by an odd word quantifiers. For example, it is possible to write "hel+o", that will mean a word beginning with "he", with following for it one or the several "l", and coming to an end on "o". It is necessary to understand, that quantifier concerns to previous expression, instead of a separate symbol.
| Symbol |
Overview |
| * |
Corresponds 0 or more occurrences of previous expression. For example, 'zo*' corresponds "z" and "zoo". |
| + |
Corresponds 1 or more previous expressions. For example, 'zo+' corresponds "zo" and "zoo", but not "z". |
| ? |
Corresponds 0 or 1 previous expressions. For example, 'do(es)?' corresponds "do" in "do" or "does". |
| *? |
Corresponds 0 or more previous expressions as few characters as possible. For example, 'dog*?' corresponds "do" in "do" or "does" |
| +? |
Corresponds 1 or more previous expressions as few characters as possible. For example, 'dog+?' corresponds "dog" in "dog" or "dogger". |
| {n} |
n - the non-negative whole. Corresponds to exact quantity of occurrences. For example, 'o{2}' will not find "o" in "Bob", but will find two "o" in "food". |
| {n,} |
n - the non-negative whole. Corresponds to the occurrence repeated not less n of time. For example, 'o{2,}' does not find "o" in "Bob", but finds all "o" in "foooood". 'o{1,}' it is equivalent 'o+'. 'o{0,}' it is equivalent 'o*'. |
| {n,m} |
m and n - non-negative integers, where n <= m. There corresponds a minimum n and a maximum m occurrences. For example, 'o{1,3}' finds three first "o" in "fooooood". 'o{0,1}' it is equivalent 'o?'. The blank space between a comma and figures is inadmissible. |
| | |
When between two characters or groups, matches one or the other (this is called an alternating operation, because it chooses among two alternatives). |
Table 1. Quantifiers.
The important feature of quantifiers '*' and '+' is their greed. They find everything, that can - instead of that is necessary.
Example.
| |
Source string - "hello out there, how are you" Regular expression - 'h.*o' It means to search 'h' which any symbols which follows 'o' follow some. In a kind, probably, was available "hello", but it will be found "hello out there, how are you" - because of greed of the regular expression looking not first, but last "o". To cure quantifier of greed it is possible, having added '?'. That is, Source string - "hello out there, how are you" Regular expression - 'h.*?o' will find "hello", as it was necessary, as searches 'h' which any symbols follow some, up to the first met 'o'. |
The ends and the beginnings of strings
Check has begun the ends or the end of a line is made by means of metasymbols ^ and $. For example, '^thing' corresponds to a line beginning with "thing". 'thing$' corresponds to a line which is coming to an end on "thing". These symbols work only at the included option 's'. At the switched off option 's' there is only an end and the beginning of the text. There is also a symbol \z, an exact end of a line.
Border of a word
For the task of borders of a word metasymbols '\b' and '\B' are used.
Regular expression 'out' corresponds not only "out" in "speak out loud", but also "out" in "please do not shout at me". To avoid it, it is possible to anticipate the sample a marker of border of a word.
With the expression '\bout' it will be found only "out" in the beginning of a word. It is necessary, that inside of a class of symbols '\b' designates a symbol backspace (deletings).
Other special characters
In the table below I have displayed other special symbols:
| Symbol |
Overview |
| \a |
Matches a bell (alarm). |
| \b |
Matches a backspace if in a [] character class; otherwise, see above. |
| \t |
Matches a tab. |
| \r |
Matches a carriage return. |
| \v |
Matches a vertical tab. |
| \f |
Matches a form feed. |
| \n |
Matches a new line. |
| \e |
Matches an escape. |
| \w |
Matches a character(a-z, A-z, 0-9 and underscore). |
| \W |
Matches any character that is not a letter. |
| \s |
Matches any white spaces(space or tab). |
| \S |
Matches any character that is not white space. |
| \d |
Matches a digit(0-9). |
| \D |
Matches any character that is not a digit. |
| . |
Matches any character, except the end of line or the end of text. |
| $ |
Matches the end of the string or line. |
Table 2. Other special symbols
Grouping and Backreferences
You can group patterns by placing them in parenthesis. You can give a name to the group as well. Here are some of the grouping constructs you'll be using:
| Construction |
Overview |
| () |
Defines a simple group. |
| (?<name>) |
Group named "name" |
| (?i:) |
Igonre case when matching within the group |
| \n |
Matches a previous group(group #n) For example, (\w)\1 finds doubled word characters. |
| \k<name> |
Matches a previous group with the specified name. For example, (?<char>\w)\k<char> finds doubled word characters. The expression (?<43>\w)\43 does the same. You can use single quotes instead of angle brackets; for example, \k'char'. |
Table 3. Grouping patterns
Groups that don't have a name, have a number.
Example.
Imports System.Text.RegularExpressions
Imports System
Namespace RegularExpressionsSample
Class Program
Shared Sub Main(ByVal args As String())
Dim regularExpression As String = "\$(\d+)\.(\d\d)"
Dim inputString As String = "$1.57 $316.15 $19.30 $0.30 $0.00 $41.10 $5.1 $.5"
Dim m As Match = Regex.Match(inputString, regularExpression)
While m.Success
Dim gc As GroupCollection = m.Groups
Console.WriteLine("The number of captures: " + gc.Count)
Dim i As Integer = 0
While i < gc.Count
Dim g As Group = gc(i)
Console.WriteLine(g.Value)
System.Math.Min(System.Threading.Interlocked.Increment(i), i - 1)
End While
m = m.NextMatch
End While
End Sub
End Class
End Namespace
Output:
The number of captures: 3 $1.57 1 57 The number of captures: 3 $316.15 316 15 The number of captures: 3 $19.30 19 30 The number of captures: 3 $0.30 0 30 The number of captures: 3 $0.00 0 00 The number of captures: 3 $41.10 41 10
Replacement
Substitutions are allowed only within a replacement pattern. For similar functionality within a regular expression, use a backreference such as \1.
Character escapes and substitutions are the only special constructs recognized in a replacement pattern. All other syntactic constructs are allowed in regular expressions only and not recognized in replacement patterns. For example, the replacement pattern 'a*${test}b' inserts the string "a*" followed by the substring matched by the "test" capturing group, if any, followed by the string "b". The * character is not recognized as a metacharacter within a replacement pattern. Similarly, $-patterns are not recognized within a regular expression matching pattern. Within a regular expression, $ denotes the end of the string. Other examples are: '$123' substitutes the last substring matched by group number 123 (decimal), and ${name} substitutes the last substring matched by a (?<name>) group.
Example.
Formatting string with replace method.
Imports System.Text.RegularExpressions
Imports System
Namespace RegularExpressionsSample
Class Program
Shared Sub Main(ByVal args As String())
Dim regularExpression As String = "(\s*)Dim\s+(\w+)\s+As\s+(\w+)"
Dim inputString As String = "Dim abc As Integer"
Dim replacement As String = "$1$3 $2;"
Console.WriteLine(Regex.Replace(inputString, regularExpression, replacement))
End Sub
End Class
End Namespace
Output:
Integer abc
Lookaround
There are two directions of lookaround - lookahead and lookbehind - and two flavors of each direction - positive assertion and negative assertion. The syntax for each is:
- (?=...) - Positive lookahead
- (?!...) - Negative lookahead
- (?<=...) - Positive lookbehind
- (?<!...) - Negative lookbehind
Understanding look(ahead|behind) requires an understanding of the difference between matching text and matching position. To help with this understanding I should state first that lookaround assertions are non-consuming. To see what I mean, let's look at the following simple example.
regularExpression = "stop" inputString = "stopping"
When the above pattern is applied to the text the "context" of the parser sits at a position in the text between the "s" and the "i" in the word stopping. This is because the regular expression parser bumps along the string as it gets a match, like so:
- Start - ^stopping
- Match "s" - ^topping
- Match "t" - s^opping
- Match "o" - st^pping
- Match "p" - sto^ping
Once the parser has moved beyond a position there is no way to reverse up and re-attempt a match. To understand where this causes difficulty, consider this, what if you needed to match the word "stop" but only when it was contained in the word "stopped" and not any other possible combination such as "stopper". With lookahead you can simply assert that condition like so: (?=stopped\b)stop
This works because, with lookaround, the parser is not bumped along the string. This can be especially useful for finding a position in a document by combining a lookahead assertion with a lookbehind assertion. To demonstrate, let's consider that we need to match the string "stop" when it was contained within the string "estopped" but not "astopped". To do this you can do a negative, lookbehind assertion on "a" and a positive lookahead assertion on "stopped", like this: (?<!a)(?=stopped\b)stop
In other words you are matching a position at which to start matching text. The above pattern would set the parser at the following position in the string "estopped"
Start - e^stopped Match "s" - e^topped Match "t" - es^opped Match "o" - est^pped Match "p" - esto^ped
Example.
Example of using lookaround would be to validate "special" password conditions such as: "Password must be between 8 and 20 characters, must contain at least 2 letter characters and at least 2 digit characters. It can only contain either letter or digit characters."
For such a password constraint, the following expression would probably do quite nicely: ^(?=.*?\d.*?\d)(?=.*?\w.*?\w)[\d\w]{8,20}$
Imports System.Text.RegularExpressions
Imports System
Namespace RegularExpressionsSample
Class Program
Shared Sub Main(ByVal args As String())
Dim regularExpression As String = "^(?=.*?\d.*?\d)(?=.*?\w.*?\w)[\d\w]{8,20}$"
Console.WriteLine("Please input password for check:")
Dim inputString As String = Console.ReadLine
If Not (inputString = "") AndAlso Regex.IsMatch(inputString, regularExpression) Then
Console.WriteLine("It's correct security password")
Else
Console.WriteLine("It's incorrect password.")
End If
Console.Read()
End Sub
End Class
End Namespace
Output:
Please input password for check: abc4D5678 It's correct security password
The most actual regular expression patterns
| Pattern |
Description |
| ^[\w-\.]+@([\w-]+\.)+[\w-]{2,4}$ |
This expression matches email addresses, and checks that they are of the proper form. It checks to ensure the top level domain is between 2 and 4 characters long. |
| ^(\(?\+?[0-9]*\)?)?[0-9_\- \(\)]*$ |
A regular expression to match phone numbers, allowing for an international dialing code at the start and hyphenation and spaces that are sometimes entered. |
| ^\d{1,2}\/\d{1,2}\/\d{4}$ |
This regular expressions matches dates of the form XX/XX/YYYY where XX can be 1 or 2 digits long and YYYY is always 4 digits long. |
| ^([0-1][0-9]|[2][0-3]):([0-5][0-9])$ |
This regular expressions matches time in the format of HH:MM
|
| \b(([01]?\d?\d|2[0-4]\d|25[0-5])\.){3}([01]?\d?\d|2[0-4]\d|25[0-5])\b |
This regular expressions matches Decimal IPs.
|
| ^\d{5}$|^\d{5}-\d{4}$ |
This regular expression will match either a 5 digit ZIP code or a ZIP+4 code formatted as 5 digits, a hyphen, and another 4 digits. |
| ^\d{3}-\d{2}-\d{4}$ |
This regular expression will match a hyphen-separated Social Security Number (SSN) in the format NNN-NN-NNNN. |
| ^((4\d{3})|(5[1-5]\d{2})|(6011))-?\d{4}-?\d{4}-?\d{4}|3[4,7]\d{13}$ |
Matches major credit cards including: Visa (length 16, prefix 4), Mastercard (length 16, prefix 51-55), Discover (length 16, prefix 6011), American Express (length 15, prefix 34 or 37). All 16 digit formats accept optional hyphens (-) between each group of four digits. |
(http|https|ftp)\://[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(:[a-zA-Z0-9]*)?/?([a-zA-Z 0-9\-\._\?\,\'/\\\+&%\$#\=~])* |
This regular expression will match some URL. |
The conclusion
It only brief conducting in regular expressions and their use. If you wish to understand better it, try to be trained in creation of regular expressions independently. Practice shows, that analysis of another's regular expressions is practically useless, to read them it is almost impossible. However it is better to learn to use them is often simplifies a life.
NOTE: THIS ARTICLE IS CONVERTED FROM C# TO VB.NET USING A CONVERSION TOOL. ORIGINAL ARTICLE CAN BE FOUND ON C# CORNER (http://www.c-sharpcorner.com/).
|
|
|
Login
to add your contents and source code to this article
|
|
|
|
|
|
|
|
|
|
|
|
C# Consulting is founded in 2002 by the founders of C# Corner. Unlike a traditional
consulting company, our consultants are well-known experts in .NET and many of them
are MVPs, authors, and trainers. We specialize in Microsoft .NET development and
utilize Agile Development and Extreme Programming practices to provide fast pace
quick turnaround results. Our software development model is a mix of Agile Development,
traditional SDLC, and Waterfall models.
|
|
Click here to learn more about C# Consulting. |
|
|
|
|
|
|
|
Introducing MaxV - one click. infinite control. Hyper-V Hosting from MaximumASP.
Finally – a virtual platform that delivers next-generation Windows Server 2008 Hyper-V virtualization technology from a managed hosting partner you can truly depend on. Visit www.maximumasp.com/max for a FREE 30 day trial. Hurry offer ends soon.
Climb aboard the MaxV platform and take advantage of High Availability, Intelligent Monitoring, Recurrent Backups, and Scalability – with no hassle or hidden fees.
As a managed hosting partner focused solely on Microsoft technologies since 2000, MaximumASP is uniquely qualified to provide the superior support that our business is built on. Unparalleled expertise with Microsoft technologies lead to working directly with Microsoft as first to offer IIS 7 and SQL 2008 betas in a hosted environment; partnering in the Go Live Program for Hyper-V; and product co-launches built on WS 2008 with Hyper-V technology.
|
Dynamic PDF
ceTE software specializes in components for dynamic PDF generation and manipulation. The DynamicPDF™ product line allows you to dynamically generate PDF documents, merge PDF documents and new content to existing PDF documents from within your applications.
|
Go.NET
Build custom interactive diagrams, network, workflow editors, flowcharts, or software design tools. Includes many predefined kinds of nodes, links, and basic shapes. Supports layers, scrolling, zooming, selection, drag-and-drop, clipboard, in-place editing, tooltips, grids, printing, overview window, palette. 100% implemented in C# as a managed .NET Control. Document/View/Tool architecture with many properties&events. Optional automatic layout.
|
Dundas Software
Dundas Chart for .NET is the most advanced .NET charting package available today. With an extremely complete feature set, elegant architecture and easy implementation, Dundas Chart can quickly add advanced Charting functionality to enhance and transform ASP.NET and Windows Forms applications. Whether you are implementing charting into internal projects, or building applications for clients, Dundas Chart offers advanced technology and advanced results to get the most out of data.
|
60 FREE UI Controls from DevExpress
Register for your FREE copy on over 60 free presentation controls from
DevExpress - Absolutely Free-of-Charge without any royalties or distribution
costs. Visit Devexpress.com/60 today. Free controls include advanced lists box, dropdown calendar, rich text edit, spin
edit, tab control and so much more!
DevExpress engineers feature rich presentation controls and reporting tools for WinForms, ASP.NET, WPF, and Silverlight. Our technologies help you build your best, see complex software with greater clarity and deliver compelling business solutions for Windows and the web in the shortest possible time.
|
Clickatell's SMS Gateway
Clickatell's Developer Solutions allow you to SMS enable any website or
application via a range of API's. Learn More about our API connections.
|
Microsoft Visual Studio 2010
Visualize your workspace with new multiple monitor support, powerful Web development, new SharePoint support with tons of templates and Web parts, and more accurate targeting of any version of the .NET Framework. Get set to unleash your creativity.
|
Nevron Chart for .NET 2010.1 Now Available
The leading .NET charting control now features PDF, Flash and Silverlight export, visualization of large datasets and more. Deliver true charting functionality to your BI, Scorecard, Presentation or Scientific apps. Download evaluation now.
|
Developer-Ready ASP.NET 2.0 Web Hosting with 3 MONTHS FREE
Now supporting .NET 3.0 Framework with Windows Workflow Foundation, Windows Communication Foundation (WCF), Windows Presentation Foundation (WPF), windows CardSpace (WCS)! Providing more flexibility for Developers with Web Services Support and a User/Permission Manger. Also supporting MS SQL 2005/2000 with Real-Time Backups, FREE Automated Attach .MDF Tool, FREE SQL Restore and Shrink SQL DB Tools, and SQL
|
|
|
|
|
|
|
|
|
|
|
|
|