|
|
GroupsGroups are user-defined subsets of the Regular Expression pattern, and are
used when processing a match to identify subsets of the matching string. You can think
of Groups as Sub-matches.
While extracting U.S. phone numbers with a regular expression that matches the
whole phone number, you may want to identify the three-digit area code. Groups
allow you to extract that subset from a match. Later on, we'll look at an
implementation for doing just that.
Syntax of Grouping Constructs
The syntax for indicating a group within a regular expression is to enclose the
subpattern within parentheses: (). When indicating a group, you can specify whether
the group should be retrievable through the Groups property, and the name of the
group. As a result, there are three variations on the Group syntax to handle these cases.
Normal Capturing Groups - ()
This syntax tells the regular expression engine to capture the group so that it can be retrieved from the Groups property
of the Match by a numeric index. Note that if the ExplicitCapture RegexOption is set, then only
named groups will be captured, and unnamed capturing groups will not be a part of the final Match.
Consider the following string of characters:
K9 DG OK D1
If we use this regular expression, which matches an uppercase letter followed
by a number (notice the letter in parentheses)...
([A-Z])\d
...it will return the following matches:
Match 1: K9
Match 2: D1
Each Match also has a group, because we used a capturing group around the
uppercase letter:
Match 1 Group: K
Match 2 Group: D
We'll see later how to programmatically access the group via the Groups property of the Match object.
Named Capturing Groups - (?<name>)
Named capturing groups are an extension of normal capturing groups and allow us to specify
a name for the group. This makes the regular expression more understandable, and the Group can
later be extracted from the Match by name as well, making for more readable and less fragile code.
Extending our earlier example, we could explicity name the group "letter" using
the following regular expression:
(?<letter>[A-Z])\d
This will result in the same matches being found, and the same groups, except
now we can access the group by name and not just by numeric index. The
following snippet comes from code that accesses the groups by name.
objGroupsCollection["letter"].Value;
objGroupsCollection("letter").Value
|
|
We'll see how to use the Group object shortly, after we looking at the non-capturing
group.
Non-Capturing Groups - (?:)
Non-capturing groups are used to instruct the regex parser to treat the subpattern as a group, but
not to capture the results as a Group.
Using our earlier example, we can make our capturing group a non-capturing
group by adding a question mark(?) and a colon (:).
(?:[A-Z])\d
This will find the same Matches, however no groups will be captured. You might
be asking, what is the purpose of grouping a subexpression if it's not going to
be used? Where non-capturing groups become useful is when you are using the OR
(|) construct within the regular expression. Look at the following pattern,
where we want to limit our matches to those that begin with A, B, or C, and are
following by a number:
A|B|C\d
What matches do you expect when the above pattern is run against the following
input string?
K9 CC C3 A1
The matches returned are:
Match 1: C3
Match 2: A
Why did the match return A, and not A1? The answer is in the order of operations. Regular expressions
are processed left to right, so the expression gets translated into (parentheses used for clarification):
Match an (A) OR a (B) OR a (C followed by a letter)
Therefore, in order to properly group the letters, we need to use a grouping
construct. If we didn't need the group in the output, then we could use a
non-capturing group and save some processing. The following corrected
expression will return the two expected matches, C3 and A1.
(?:A|B|C)\d
The GroupCollection and Group objects
Now that you have learned about the grouping syntax, I'm sure you are eager to see how to programmatically
retrieve groups. Each Match object has a Groups property. This returns a GroupCollection containing a
series of Group objects.
Regex r = new Regex(@"([A-Z])\d)");
Match m = r.Match("K9 DG OK D1);
GroupCollection gc = m.Groups;
Dim r as Regex = new Regex("([A-Z])\d)")
Dim m as Match = r.Match("K9 DG OK D1)
Dim gc as GroupCollection = m.Groups
|
|
Because the GroupCollection, like the MatchCollection, implements ICollection
and IEnumerable, as well as an Indexer, you can access the Group objects using
the foreach syntax, or using the array syntax. To output the first match, you
could simply write:
Console.WriteLine("Match 1:" + gc[0].Value);
Console.WriteLine("Match 1:" + gc(0).Value)
|
|
As mentioned, the foreach syntax can be used to iterate through all the Group
objects in the GroupCollection:
foreach(Group g in gc)
{
 Console.WriteLine("Match:" + g.Value);
}
for each g as Group in gc
 Console.WriteLine("Match:" + g.Value)
next
|
|
One behavior of groups that you need to watch out for is that the first Group
in a GroupCollection will be the entire Match string itself.
Example
Having covered the purpose and syntax of groups, let's look at a practical
example. In our previous sections, we have used the following pattern to
identify phone numbers in a string:
\d\d\d-\d\d\d-\d\d\d\d
Now, what if we wanted to extract the area code from the match? We would parse
the entire match for the first 3 characters, but parsing wouldn't be a flexible
solution, and would definitely be cumbersome for more complicated regular
expressions. Instead, we can use the grouping construct, a pair of parentheses
(), around the portion we are interested in. Our new regular expression
becomes:
(\d\d\d)-\d\d\d-\d\d\d\d
static void Main(string[] args)
{
    Match m1 = Regex.Match("Our phone number is 508-888-8888.", @"(\d\d\d)-\d\d\d-\d\d\d\d");
  if (m1.Success)
  {
        Console.WriteLine("The value '{0}' was found at index {1}, and is {2} characters long.", m1.Value, m1.Index, m1.Length);
        foreach(Group g in m1.Groups)
    {
      Console.WriteLine("A Group, '{0}', was found at index {1}, and is {2} characters long.", g.Value, g.Index, g.Length);
    }
  }
}
public sub Main()
    Dim m1 as Match = Regex.Match("Our phone number is 508-888-8888.", @"(\d\d\d)-\d\d\d-\d\d\d\d")
  if (m1.Success) then
        Console.WriteLine("The value '{0}' was found at index {1}, and is {2} characters long.", m1.Value, m1.Index, m1.Length)
        for each g as Group in m1.Groups
      Console.WriteLine("A Group, '{0}', was found at index {1}, and is {2} characters long.", g.Value, g.Index, g.Length)
    next
 end if
end sub
|
|
The results, which also show that the first group is the match itself:
|
|