We eat a lot of JDBC Type 4 database driver connections at Portico Systems. One of the challenges during installations is ensuring the deployment team gets the syntax of these URL's correct. We have some decent validation of them embedded already, but recently I decided to make it bullet proof.
Since Java 1.4 Regular Expressions were baked into the Java stack. For example the Pattern class. This is a beautiful thing. (Before 1.4 libraries like Jakarta ORO were used to get the power of regex in our code.)
So I was thinking that surely out there on the inker-net somebody would have posted clever regexes to handle JDBC URL's as well as IP quad format addresses. What I discovered were some bits and pieces - some wrong, some close, and none really complete.
So here is my contribution; I hope it is useful to you. Rather than walk you through this beast I am just going to put it out there. If you dance with this pattern, I recommend the wonderful Regex Coach, a Common Lisp based (!) native Windows app. I wish I had something like it on OS X and Linux.
Back to my pain... the biggest challenge to wrestle was that an Oracle Type 4 URL could have an IP address, or a hostname embedded in it. Both of these are very tricky to handle in a regex. The pattern I put together comes really close to perfection, in fact it may just be there already but I don't want to jinx this blog. I am guessing I may have missed something, but so far all my JUnit tests are passing.
What's the big whoop bout IP Addresses in "dotted quad"? They are tricky because of the limited range of each number - 1 to 255 for each quad. And if you really wanted to get technical, there are certain reserved ranges you might want to lock out, but I just don't have that kind of time. Oh, and I am not worrying about IPv6 with this regex pattern.
And hostnames, aren't they easy? NOPE. How many people out there really know the spec? I was forced long ago, when operating my own DNS servers, to appreciate RFC 952 and RFC 882. They differ on some points, so I am not entirely clear on how it all shakes out... for example RFC 952 disallows single character names and RFC 882 allows them. But they agree on most of the high level basics.
From RFC 952:
1. A "name" (Net, Host, Gateway, or Domain name) is a text string up
to 24 characters drawn from the alphabet (A-Z), digits (0-9), minus
sign (-), and period (.). Note that periods are only allowed when
they serve to delimit components of "domain style names". (See
RFC-921, "Domain Name System Implementation Schedule", for
background). No blank or space characters are permitted as part of a
name. No distinction is made between upper and lower case. The first
character must be an alpha character. The last character must not be
a minus sign or period. (...) Single character names
or nicknames are not allowed.
Hmm... I have to wonder if the 24 position thing is really a limit anymore, and aren't their root name servers on the internet with single position names? From RFC 882:
Appendix 1 - Domain Name Syntax Specification
The preferred syntax of domain names is given by the following BNF
rules. Adherence to this syntax will result in fewer problems with
many applications that use domain names (e.g., mail, TELNET). Note
that some applications described in  use domain names containing
binary information and hence do not follow this syntax.
<domain> ::= <subdomain> | " "
<subdomain> ::= <label> | <subdomain> "." <label>
<label> ::= <letter> [ [ <ldh-str> ] <let-dig> ]
<ldh-str> ::= <let-dig-hyp> | <let-dig-hyp> <ldh-str>
<let-dig-hyp> ::= <let-dig> | "-"
<let-dig> ::= <letter> | <digit>
<letter> ::= any one of the 52 alphabetic characters A through Z
in upper case and a through z in lower case
<digit> ::= any one of the ten digits 0 through 9
Note that while upper and lower case letters are allowed in domain
names no significance is attached to the case. That is, two names
with the same spelling but different case are to be treated as if
The labels must follow the rules for ARPANET host names. They must
start with a letter, end with a letter or digit, and have as interior
characters only letters, digits, and hyphen. There are also some
restrictions on the length. Labels must be 63 characters or less.
I have attempted to faithfully adhere to both these specs where they agree, and where they do not - I allow for 63 position as well as single position names.
Finally, note that my regex creates 3 matching groups:
- hostname or IP address
- Oracle SID
These groupings are your handles to programmatically pluck out the URL pieces you want to chew on.
Let me know if I missed something! (NOTE - my blog was having trouble with the long string. I have chopped it into five bits. You must put the following five lines into one long string to create the regex, or click here to get it in a downloaded text file.)
Once you have the previous five lines stitched together into one regex, you should be able to match the following example URL's, and break them down into the three groupings: