What are Regular Expressions?
Regular Expressions is a term used to refer to a pattern-matching
technology for processing text that has existed in the UNIX world for years and
has now been incorporated into the .NET Base Class Library. A Regular
Expression itself is an string that represents a pattern, encoded using the
regular expression language and syntax. Using this regular expression, you can
parse html, log files, documents, or any other string sources, looking for
substrings matching the pattern, and perform extraction and editing functions.
Although there is no standards body governing the regular expression language,
Perl 5, by virtue of it's popularity, has set the standard for regular
expression syntax. The .NET Framework Regular Expressions library is designed
to be compatible with Perl 5 regular expressions, as well as including
additional features not found elsewhere.
Example
To give you a taste of how regular expressions work, let us look at an example.
I once had a professor who proclaimed that history was summed up by the -ism's
[a term referring to words ending in i-s-m. Example, existentialism.] Suppose
you were given a document, and asked to extract the ism's mentioned inside of
it. Our sample contains the following passage:
Buddhism, Confucianism, and Taoism form the basis of Chinese philosophy, and
are as central to the culture as Individualism is to the United States.
How would you extract the isms? You could manually proceed line by line,
without the help of a computer, and record the isms you see. That would
work, but it would take a while, and after all, you are a programmer. You could
also write a string parsing utility to separate the document into words, and
then see which words ended in ism. That too would work, but would require more
effort. With regular expressions, you can extract the isms with this pattern:
\w*ism
This pattern says to look for a series of zero or more characters (\w*) ending
in ism. Running this pattern against the above passage results in the following
matches extracted for you:
Buddhism
Confucianism
Taoism
Individualism
Regular expressions can perform more complex searches, and in this guide to
Regular Expressions, we'll cover the syntax of regular
expressions, the classes in the System.Text.RegularExpressions,
and how to use them. Using our online RegEx Tester,
you can test regular expressions online. We'll finish off our discussion with
some examples, such as how to extract information
from web pages.
|