BizTalk Utilities CV ,   Jobs ,   Code library  
 
 

Regular Expressions Page

Regular Expressions-the powerful string pattern matching and manipulation tool is finally available to VB programmers. This powerful mini-language can be intimidating at first, but you can start out simply and then attempt more as your knowledge grows.  But whether you are experienced or brand new be sure to visit http://www.topxml.com/regextools/

Please note I sometimes mention advance regular expression features in an example without explaining what they are. When I do this, knowing exactly what these features do is not important to the main point.

Rambling Intro to Regular Expressions for VBer's(click to expand)

Through most of the nineties I programmed in Visual Basic. I liked VB a lot, but I couldn't help noticing an awful of Perl books at my local bookstore. Just what was the deal with that? I was interested in finding out, but I was busy learning about SQL and DAO, no the ODBC API, no wait RDO, no no it's ODBC direct, but wait Jet 3.5 is not nearly as bad...no forget about that it's ADO! Wait now I need to learn ADO+. Yes, yes ADO+ is the ticket. Yes.

Now where was I? Oh yeah , I was talking about learning things, or actually talking about not learning things because of MS's data object model du jour.

In November 1998, I attended a day-long seminar on XML given by Dr. Steve DeRose, editor of XPath, XPointer and XLink. Dr. DeRose repeatedly made off-hand remarks about regular expressions and it became clear that many of the people who created XML used Perl because of the fact that regular expressions are an essential part of that language. (Of course, Perl comes from a long line of regex endowed languages including Lex and YACC).

Being from the VB world I had never used them. If I was to become a true XML geek, I needed to get with the program! I went out and  got the book Mastering Regular Expressions. I was alternatively attracted and repelled by what I read. Attracted to regular expressions and repelled by the rest of Perl. But I considered making the jump to it! Not too long after, regexes became available in VBScript 5.0.-which put off my jump to Perl. And now with VBScript 5.5, we have access to much of Perl 5's regex syntax. And while it is true that regex's are not as integrated as they are in Perl. VB, now that it has regex's (and with Visual Basic.Net on its way), is going to be a strong hand to play for a long time.

Windows Script 5.5 (which is JScript and VBScript) is available at http://msdn.microsoft.com/scripting/vbScript/download/vbsdown.htm

Be sure to download the documentation, although its coverage of regular expressions isn't particularly great.

The Four modes of usage

A stumbling block to understanding regex's can be the fact that there are four modes of uses. They are:
  • validation similar to VB's LIKE operator)
  • searching for pattern (similar to VB's Instr function)
  • global searching for all matches (nothing like this in VB)
  • global replacement (similar to VB's Replace function)

Very quickly, (this isn't a tutorial)

For validation global is set to false, ^ and $ enclose the pattern and the test method is used, which returns a boolean.

For searching global is set to false. And the Execute method is invoked to return a Matchcollection, which will be empty or have one match.

For global searching global is set to true. And the Execute method is invoked to return a Matchcollection, which will be empty or have one or more matches.

For global replacement global is set to true. And the Replace method is used. Nothing is returned, the string that is passed to it is changed. Replace takes a second argument which is the text that will replace the matched text. It is in this second argument that you can use the dollar sign replacement variables.

(I'll be adding more VB Specifics soon)

RegExTools Homework

Fire up, RegexTools, set global to on and Ignore Case to false. Write out a paragraph like the one below for "all" of your names. And then create a regex that match all of them! Include a few near misses that you don't want matched.

My full name is Thomas Markert Bosley, but I go by Mark, although my Aunt calls me Marky(not Markie!). Markert is a German surname. In high school, I was addressed as "you" or "hey you", but I don't want to remember that and so I don't want to match that. Most junk mailers call me Thomas M Bosley or T M Bosley

While coming up with this, at one point I registered 229 matches! Most of them blank. I had fallen into a very common trap. I had every item marked as optional ? and so the regex matched nothing! In the final answer the parenthesis (M...) has no ? after it, so it must occur once in every match. The slash b is also important, without it M was matching the M in Most and Markie. Notice also how I had to nest (M(ark(ert|y)?)?) Without proper nesting I could have ended up with Mert

My solution:  ((Thomas|T)\s)?(M(ark(ert|y)?)?)\b(\sBosley)?

I have never been addressed as Thomas Marky, but the combination will match. Perhaps with a negative look-ahead we could solve that. But, we'll save that for some other assignment.

XPath and Regexes

It is worth a minute or so to compare XPath and Regexes. They both allow you to declaratively select a particular item or group of items. For XPath, the selected items returned are a node-set, for regexes, they are (with the MS implementation) in the MatchCollection 

One fundamental difference between the two is that XPath works on hierarchies which have a width and breadth, while re's work on a string (which can be thought of as a one-dimensional object). Thus, re's have nothing like XPath's //, .., or / which step up or down the hierarchy. XPath also allows you to search for the name of an element and or the value of an element or attribute. For re's you are dealing with a simple string, the concept of name does not come into play.

One syntax difference you need to understand is the bar |. In re's the | marks alternatives, when a match is made, the other alternatives are not tried. In XPath the bar is a union, when a match is made, it is added to the node-set and then the other expressions are tested also and added accordingly.

 

Regular Expression Syntax-its so ugly its cute!

When I was a boy, our family had a bulldog. Everybody commented on what an ugly dog he was, but I knew that they were just jealous. True, our dog Duke was ugly on one level (the outward appearance level), but in another way he was the cutest dog around.

Regular Expressions have a similar appeal. Looking at something like "[^"\\]*(\\.[^"\\]*)*" probably brings back memories of the time you tried to hack an executable with a hex editor. Regex's certainly don't look as if they provide a higher level of abstraction.

But regex's do indeed provide a higher level or abstraction if you can get beyond the syntax. And one great way to get over this hurdle is to download RegExTools, fire it up and start typing away.

I'll just point out a few things about regular expression syntax.

First, some symbols do double duty. ^ marks the beginning of a line, but inside of a character class it "negates", so the pattern ^[^a] matches the beginning of a line with something other than an "a". The dot (that is the period) stands for any char, but inside a char class it stands for a literal period. The backslash also does double duty. It escapes characters that are a part of regex syntax (meta-characters), so \^ matches the literal ^. It also, in a reverse manner, turns certain letters into marks predefined character classes and assertions (\d marks the digit character class.).

Second, there are many different symbols or sequences that are exactly equivalent. \d is the same as [0-9], \w is the same as [A-z0-9_]{0,} is the same as *, ? is the same as {0,1} etc. So, \d\d\d-?\d\d\d\d is the same as [0-9]{3}-?{0,1}[0-9]{4}. Don't get thrown off by this!

Third, characters classes are negated by making them upper case. So, \d matches 0-9, \D matches everything else.

Fourth,  parenthesis have four distinct uses.

Most obviously, they, along with a bar, delineate the extent of an alternation.

    They allow a sequence of characters to be operated on by a quantifier. So with Mark+ the plus operates on the k, while the plus in (Mark)+ operates on Mark.

    You use parenthesis to mark off subpatterns

    Lastly, parenthesis "memorize" a match for use with backreferences.

    These last two uses have a performance cost so Larry Wall added syntax in Perl 5 to create "non-capturing" parenthesis. (This is available in the VBScript 5.5 regex engine)

    So in conclusion, (?:^[^b]\b)  is a non-capturing parenthesis that matches any character except b followed by a word border. My, how cute!

     

    Please Hang up and Try Again

    Another Regex Code Diet

    I don't want to beat a dead horse, well actually I do. In fact, I'll beat two of them.
    Need to prevent non-alpha numeric characters?
    Need to validate an email?
    Corey Haines reminded me that there is an entire world out there beyond .org and .com. A meager correction
    Mark Bosley is a programmer based in Milwaukee, WI. You can reach him at mark@lightcc.com
     

    Recent Jobs

    Software Developers Needed in Charl
    Sr. Software Engineer - Analytics
    Immediate Mainframe openings for Ch
    Immediate TANDEM-TAL openings for C
    Immediate ASP.NET/C# Openings for C

    View all Jobs (Add yours)
    View all CV (Add yours)



    conference call service
    swimming pool contractor
    conference calls
    water softener
    Teleconference
    Host Department NOLIMIT Web Hosting
    MSN
    sunglasses


        Email TopXML  

    Front Page Daily Stuff TopXML Forum XML blogs XML Newsgroups BizTalk Biztalk Utilities Biztalk Utilities Tutorial B2B SAP XML Microsoft .NET Dotnet System XML Soapformatter SQLXML XMLserializer XQuery PHP PHP SimpleXML PHP XML Dom PHP XML RPC PHP XSLT Java Java Java XML Xalan Microsoft ASP ASP Schemas XML SQL Server XML XMLDom XSL XSL Tutorial XSLT Stylesheets General Javascript CSS XHTML WAP