Rambling Intro
to Regular Expressions for VBer's(click to expand)
Through most of the nineties I programmed in
Visual Basic. I liked VB a lot,
but I couldn't help noticing an awful of Perl books at my
local bookstore. Just what was the deal with that? I was interested
in finding out, but I was busy learning about SQL and DAO, no
the ODBC API, no wait RDO, no no it's ODBC direct, but wait Jet 3.5
is not nearly as bad...no forget about that it's ADO! Wait now
I need to learn ADO+. Yes, yes ADO+ is the ticket. Yes.
Now where was I? Oh yeah , I was talking about
learning things, or actually talking about not learning things because of
MS's data object model du jour.
In
November 1998, I attended a day-long seminar on XML given by
Dr. Steve DeRose, editor of XPath, XPointer and XLink. Dr. DeRose repeatedly
made off-hand remarks about regular expressions and it became clear
that many of the people who created XML used Perl because
of the fact that regular expressions are an essential part of that language. (Of course,
Perl comes from a long line of regex endowed languages including Lex
and YACC).
Being from the VB world I had never used them. If I was to become
a true XML geek, I needed to get with the program! I went out
and got the book Mastering Regular Expressions. I was
alternatively attracted and repelled by what I read. Attracted to
regular expressions and repelled by the rest of Perl. But I
considered making the jump to it! Not too long after, regexes became
available in VBScript 5.0.-which put off my jump to Perl. And now
with VBScript 5.5, we have access to much of Perl 5's regex syntax.
And while it is true that regex's are not as integrated as they are
in Perl. VB, now that it has regex's (and with Visual Basic.Net
on its way), is going to be a strong hand to play for a long
time.
Windows Script 5.5 (which is JScript and VBScript) is
available at
http://msdn.microsoft.com/scripting/vbScript/download/vbsdown.htm
Be sure to download the documentation, although its coverage of regular expressions isn't
particularly great.
The Four modes of usage
A stumbling block to understanding regex's can be the fact that
there are four modes of uses. They are:
- validation similar to VB's LIKE operator)
- searching for pattern (similar to VB's Instr function)
- global searching for all matches (nothing like this in VB)
- global replacement (similar to VB's
Replace function)
Very quickly, (this isn't a tutorial)
For validation global is set to false,
^ and $ enclose the pattern and the test method is used, which returns a boolean.
For searching
global is set to false. And the Execute
method is invoked to return a Matchcollection, which will be empty
or have one match.
For global searching
global is set to true. And the Execute method is
invoked to return a Matchcollection, which will be empty or have
one or more matches.
For global replacement
global is set to true. And the Replace method is used. Nothing is
returned, the string that is passed to it is changed. Replace takes
a second argument which is the text that will replace the matched
text.
It is in this second argument that you can use the
dollar sign replacement variables.
(I'll be adding more VB Specifics
soon)
RegExTools Homework
Fire up, RegexTools, set global to on and Ignore Case
to false. Write out a paragraph like the one below for "all" of your names.
And then create a regex that match all of them! Include a
few near misses that you don't want matched.
My full name is Thomas Markert Bosley,
but I go by Mark, although
my Aunt calls me Marky(not Markie!). Markert is a German surname. In high
school, I was addressed as "you" or "hey you", but I don't
want to remember that and so I don't want to match that. Most
junk mailers call me Thomas M Bosley or T M
Bosley
While coming
up with this, at one point I registered 229 matches! Most of them
blank. I had fallen into a very common trap. I had
every item marked as optional ? and so the regex matched nothing! In
the final answer the parenthesis (M...) has no ? after it, so it must
occur once in every match. The slash b is also important, without it M was
matching the M in Most and Markie. Notice also how I had to nest
(M(ark(ert|y)?)?) Without proper nesting I could have ended up with Mert
My solution:
((Thomas|T)\s)?(M(ark(ert|y)?)?)\b(\sBosley)?
I have never been addressed as Thomas Marky,
but the combination will match. Perhaps with a negative look-ahead
we could solve that. But, we'll save that for some other assignment.
XPath and Regexes
It is worth a minute or so to compare XPath
and Regexes. They both allow you to declaratively
select a particular item or group of items. For XPath, the selected items returned
are a node-set, for regexes, they are (with the MS implementation) in
the MatchCollection
One fundamental difference between the
two is that XPath works on hierarchies which have a width and breadth, while re's
work on a string (which can be thought of as a
one-dimensional object). Thus, re's have nothing like XPath's
//, .., or / which step up or
down the hierarchy. XPath also allows you to search
for the name of an element and or the value of an element or
attribute. For re's you are dealing with a simple string, the
concept of name does not come into
play.
One syntax difference you need to understand is the bar
|. In re's the | marks alternatives, when a match is made, the other
alternatives are not tried. In XPath the bar is a union, when
a match is made, it is added to the node-set and then the other
expressions are tested also and added accordingly.
Regular Expression Syntax-its so ugly its cute!
When I was a boy, our family had a bulldog.
Everybody commented on what an ugly dog he was, but I knew that they were just
jealous. True, our dog Duke was ugly on one level (the outward appearance
level), but in another way he was the cutest dog around.
Regular Expressions have a similar appeal. Looking at something like
"[^"\\]*(\\.[^"\\]*)*"
probably brings back memories of the time you tried to
hack an executable with a hex editor. Regex's certainly don't look as if
they provide a higher level of abstraction.
But regex's do indeed
provide a higher level or abstraction if you can get beyond the syntax. And one great
way to get over this hurdle is to download RegExTools, fire it up
and start typing away.
I'll just point out a few things about regular expression syntax.
First, some symbols do double duty. ^
marks the beginning of a line, but inside of a character class it
"negates", so the pattern ^[^a]
matches
the
beginning of a line with something other than an "a". The dot (that is the
period) stands for any char, but inside a char class it stands for a literal period. The backslash also does double duty.
It escapes characters that are a part of regex syntax (meta-characters), so \^
matches the literal ^. It also, in a reverse
manner, turns certain letters into marks predefined character classes and
assertions (\d
marks the digit character
class.).
Second, there are
many different symbols or sequences that are exactly equivalent.
\d is the same as [0-9],
\w is the same as [A-z0-9_] , {0,} is the same as
*, ? is the same as
{0,1} etc. So, \d\d\d-?\d\d\d\d is the same
as [0-9]{3}-?{0,1}[0-9]{4}. Don't get thrown off by this!
Third,
characters classes are negated by making them upper case. So,
\d matches 0-9, \D matches
everything else.
Fourth, parenthesis have four distinct uses.
Most obviously,
they, along with a
bar, delineate the extent of an alternation.
They allow a
sequence of characters to be operated on by a quantifier. So with
Mark+ the plus operates on the k,
while the plus in (Mark)+ operates on
Mark.
You use parenthesis to mark off subpatterns
Lastly, parenthesis "memorize" a
match for use with backreferences.
These last two
uses have a performance cost so Larry Wall added
syntax in Perl 5 to create "non-capturing" parenthesis. (This is available in the VBScript
5.5 regex engine)
So in conclusion, (?:^[^b]\b) is a
non-capturing parenthesis that matches any character except
b followed by a word border. My, how cute!
|