.netCoders Contact Us
Search:

Example - URL Extractor

Broken links are a thorn in every site owner's side. Suppose your boss asked you to develop a tool to identify links on the company website, so that they could be fed to a validation engine. Having studied the System.IO namespace, you're familiar with opening text files and loading their contents into memory.

URLs can be constructed in a variety of ways, but for this example, we'll consider the standard <a> tag, with it's href="" attribute. Sample links that we'll need to support look like the following:

<a href="http://www.dotnetcoders.com">DotnetCoders.com Website</a>
<a href = "http://www.dotnetcoders.com">DotnetCoders.com Website</a>
<a href=http://www.dotnetcoders.com>DotnetCoders.com Website</a>
Therefore, the task of identifying all of the URLs in an html file involves extracting values of every href attribute. Having read through this guide to Regular Expressions, you instantly realize that regular expressions can help you identify and extract this pattern.
Starting the Pattern
First, we notice that every attribute begins with href, so that will be the start of our pattern:

href
Next, an equals (=) sign. In HTML, this does not have to immediately follow the href text, but can be separated by zero or more whitespace characters. Therefore, we add \s* to indicate zero or more whitespace characters, followed by an equals sign:
href\s*=
In HTML, any number of whitespace characters can also come in between the equals sign and the opening quote or beginning of the URL. Again, we'll use \s* to indicate zero or more whitespace characters:
href\s*=
URLs
At this point, URLs can take one of two forms. They can either be enclosed in double quotes, as in the example href="http://www.dotnetcoders.com", or the URL can be entered without the quotes, in which case the next whitespace character indicates the end of the URL. We'll consider each of these cases individually.
First, the case where the url is enclosed in quotes. Our pattern thus starts out as a sequence of characters surrounded by double quotes:
\"...url...\"
Several characters can be used to make up a URL, including letters, numbers, slashes, colons, question marks, and ampersands. Rather than indicate every possible acceptable character, we can utilize our knowledge of an unacceptable character that marks the end of the URL, the closing quote (") character. Our URL then becomes a sequence of zero or more characters, as long as each character is not a quote. We'll use a named group so that we can extract the URL from the match, without the double quotes:
\"(?<url>[^\"]*)\"
Adding this to our base pattern results in the following, which successfully matches all cases of href attributes with their values enclosed in double quotes.
href\s*=\s*\"(?<url>[^\"]*)\"
We still have the case of URL's not enclosed in double quotes. Again, rather than indicate all acceptable characters, we leverage the fact that a whitespace character indicates the end of the URL. The pattern for this looks similar to the previous url pattern, where here we indicate that the URL is a sequence of non-double quote characters. We also give this pattern a group name so we can extract the URL:
(?<url>[^\s]* )
Combining the Patterns
Because we want to include both cases, we use the OR specifier (|) to indicate that the URL will either be enclosed in quotes, or it will not. We need to use a non-capturing group to encompass our OR clause. Using the same group name tells the regex engine to include matches to either group in the same group collection. Our final regular expression looks like this:
href\s*=\s*(?:(?:\"(?<url>[^\"]*)\")|(?<url>[^\s*] ))

Testing the Pattern
The following sample code downloads the W3C homepage and parses it for links.

using System;
using System.Net;
using System.Text.RegularExpressions;
using System.IO;

class TestHarness
{

    public static void Main()
    {

        HttpWebRequest request;
        HttpWebResponse response;
        Stream s;


        // create a request to the url
        request = (HttpWebRequest) WebRequest.Create("http://www.w3c.org/");

        // get the response
        response = (HttpWebResponse) request.GetResponse();

        // get the stream of data and read into a string
        s = response.GetResponseStream();
        string strContents = new StreamReader(s).ReadToEnd();

        Regex r = new Regex("href\\s*=\\s*(?:(?:\\\"(?<url>[^\\\"]*)\\\")|(?<url>[^\\s]* ))");
        MatchCollection mc1 = r.Matches(strContents);
        Console.WriteLine(r.ToString());
        foreach(Match m1 in mc1)
        {
            //Output details of Match
            Console.WriteLine("Match: {0}", m1.Value);

            //Output details of Groups
            foreach(Group g in m1.Groups)
            {
                Console.WriteLine("URL: {0}", g.Value);
            }
        }
    }
}
C# VB