|
|
Example - URL Extractor
Broken links are a thorn in every site owner's side. Suppose your boss asked
you to develop a tool to identify links on the company website, so that they
could be fed to a validation engine. Having studied the System.IO namespace,
you're familiar with opening text files and loading their contents into memory.
URLs can be constructed in a variety of ways, but for this example, we'll
consider the standard <a> tag, with it's href="" attribute. Sample links
that we'll need to support look like the following:
<a href="http://www.dotnetcoders.com">DotnetCoders.com Website</a>
<a href = "http://www.dotnetcoders.com">DotnetCoders.com
Website</a>
<a href=http://www.dotnetcoders.com>DotnetCoders.com Website</a>
Therefore, the task of identifying all of the URLs in an html file involves
extracting values of every href attribute. Having read through this guide to
Regular Expressions, you instantly realize that regular expressions can help
you identify and extract this pattern.
Starting the Pattern
First, we notice that every attribute begins with href, so that will be the
start of our pattern:
href
Next, an equals (=) sign. In HTML, this does not have to immediately follow the
href text, but can be separated by zero or more whitespace characters.
Therefore, we add \s* to indicate zero or more whitespace characters, followed
by an equals sign:
href\s*=
In HTML, any number of whitespace characters can also come in between the
equals sign and the opening quote or beginning of the URL. Again, we'll use \s*
to indicate zero or more whitespace characters:
href\s*=
URLs
At this point, URLs can take one of two forms. They can either be enclosed in
double quotes, as in the example href="http://www.dotnetcoders.com", or the URL
can be entered without the quotes, in which case the next whitespace character
indicates the end of the URL. We'll consider each of these cases individually.
First, the case where the url is enclosed in quotes. Our pattern thus starts
out as a sequence of characters surrounded by double quotes:
\"...url...\"
Several characters can be used to make up a URL, including letters, numbers,
slashes, colons, question marks, and ampersands. Rather than indicate every
possible acceptable character, we can utilize our knowledge of an unacceptable
character that marks the end of the URL, the closing quote (") character. Our
URL then becomes a sequence of zero or more characters, as long as each
character is not a quote. We'll use a named group so that we can extract the
URL from the match, without the double quotes:
\"(?<url>[^\"]*)\"
Adding this to our base pattern results in the following, which successfully
matches all cases of href attributes with their values enclosed in double
quotes.
href\s*=\s*\"(?<url>[^\"]*)\"
We still have the case of URL's not enclosed in double quotes. Again, rather
than indicate all acceptable characters, we leverage the fact that a whitespace
character indicates the end of the URL. The pattern for this looks similar to
the previous url pattern, where here we indicate that the URL is a sequence of
non-double quote characters. We also give this pattern a group name so we can
extract the URL:
(?<url>[^\s]* )
Combining the Patterns
Because we want to include both cases, we use the OR specifier (|) to indicate
that the URL will either be enclosed in quotes, or it will not. We need to use
a non-capturing group to encompass our OR clause. Using the same group name
tells the regex engine to include matches to either group in the same group
collection. Our final regular expression looks like this:
href\s*=\s*(?:(?:\"(?<url>[^\"]*)\")|(?<url>[^\s*] ))
Testing the Pattern
The following sample code downloads the W3C homepage and parses it for links.
using System;
using System.Net;
using System.Text.RegularExpressions;
using System.IO;
class TestHarness
{
 public static void Main()
 {
   HttpWebRequest request;
   HttpWebResponse response;
   Stream s;
       request = (HttpWebRequest) WebRequest.Create("http://www.w3c.org/");
       response = (HttpWebResponse) request.GetResponse();
       s = response.GetResponseStream();
   string strContents = new StreamReader(s).ReadToEnd();
   Regex r = new Regex("href\\s*=\\s*(?:(?:\\\"(?<url>[^\\\"]*)\\\")|(?<url>[^\\s]* ))");
   MatchCollection mc1 = r.Matches(strContents);
   Console.WriteLine(r.ToString());
   foreach(Match m1 in mc1)
   {
           Console.WriteLine("Match: {0}", m1.Value);
           foreach(Group g in m1.Groups)
     {
       Console.WriteLine("URL: {0}", g.Value);
     }
   }
 }
}
imports System
imports System.Net
imports System.IO
imports System.Text.RegularExpressions
public module TestHarness
 public sub Main()
   Dim request as System.Net.HttpWebRequest
   Dim response as System.Net.HttpWebResponse
   Dim s as Stream
       request = CType(WebRequest.Create("http://www.w3c.org/"), HttpWebRequest)
       response = CType(request.GetResponse(), HttpWebResponse)
       s = response.GetResponseStream()
   Dim strContents as string = new StreamReader(s).ReadToEnd()
   Dim r as RegEx
   r = new RegEx("href\s*=\s*(?:(?:""(?<url>[^\""]*)\"")|(?<url>[^\s]* ))")
   Dim mc1 as MatchCollection = r.Matches(strContents)
   Console.WriteLine(r.ToString())
   for each m1 as Match in mc1
   
           Console.WriteLine("Match: {0}", m1.Value)
           for each g as Group in m1.Groups
       Console.WriteLine("URL: {0}", g.Value)
     next
   next
 end sub
end module
|
|

|
|