The Near-Perfect Email Validating Regular Expression

There have been hundreds of posts on the internet by developers claiming to have an “excellent” regular expression through which an app can validate an address supplied by an end user. However, what you get is a quite long unintelligible string. While you can test it against one or more addresses, you don’t really know its limitations or what its going to reject.

Fortunately some interested parties have created sites for the sole purpose of testing supplied expressions: the one I’ve most often used is here. However, as you can see there, no expression passes all good addresses or rejects all the bad addresses.

What really got me interested in this was the inconsistency of the expressions, the lack of traceability to the relevant specs, and even the opportunity to fix a known failure (as if I could even read the expression)! Then, while trying to find a URL validator, I tripped on a site created by Jeff Roberson, who constructed a complicated URL validator by using the relevant RFC, then developing a small regular expression snippet for each of the components, then finally assembling the final full-featured expression. Really impressive.

So early in 2013 I set out to develop a totally standards compliant regular expression. The first big hurtle was the claim that the spec used recursion, and thus no regular expression could ever meet it. Well, it turns out that only comments embedded within an email can be nested, and whoever saw a comment in an email address!

Comments take the form “(some text)”, and nesting occurs when “some text” contains a comment. So, all this fuss about recursion is focused on a feature that no one uses! In the end, to insure the regular expression I produced would pass some pretty severe existing tests, I implemented comments to a user specified fixed level using the “or”regular expression feature. Thus, a “comment” is { “comment” | “comment nested to one level” | “comment nested to two levels” }. Note that to pass the tests you need to handle a nesting level of 5, which greatly increases the size of the regular expression, and which no real app would ever use.

The second hurtle is that the test suites do not rely solely on the principal RFC (RFC-5322), but on related RFCs, some of which contradict 5322! In the end I had to incorporate information relating to the specification of IPV6 addresses (complex!), IPV4 addresses, part lengths, and even contradictions within the core RFC (text says one thing, the ABNF says something else).

A final hurtle was dealing with “deprecated” rules, those that the spec officially recommends and those are old and should be supported but shouldn’t be used anyway. In the end I solved this by deciding to NOT support deprecated rules (made my life easier).

It became obvious immediately that there is no “perfect” RFC – if the text and ABNF contradict themselves, you can have it one way or the other, but not both ways! The solution was to punt the decision to the final URL creator, and let that person make the decision as to what to accept and what not to.

Another issue was what to do about regex syntax: ObjectiveC on the Mac uses the ICU package, which was based on Perl. Portions of that syntax are not supported by the “C” POSIX side of Macs, and any other POSIX derived regular expression package. The spec uses terms that better match Perl too. In the end, I was able to craft the regular expression so that it would mostly work on both, and could be tailored for one or the other by changing a few items.

The final result is a Mac App that can construct a Regular Expression to your specification, output it in text or string format, and can be used to interactively test against a text the user entered or pastes into the app. Additionally, the app contains a class for use in validating or extracting regular expressions, and could be with some small effort ported to other languages. There is a C function to validate a single email address too.

AppScreenShot

So, what does the near-perfect email validating expression look like? Like this:

“^(?:(?:(?:(?: )*(?:(?:(?:\\t| )*\\r\\n)?(?:\\t| )+))+(?: )*)|(?: )+)?(?:(?:(?:[-A-Za-z0-9!#$%&’*+/=?^_`{|}~]+(?:\\.[-A-Za-z0-9!#$%&’*+/=?^_`{|}~]+)*)|(?:\”(?:(?:(?:(?: )*(?:(?:[!#-Z^-~]|\\[|\\])|(?:\\\\(?:\\t|[ -~]))))+(?: )*)|(?: )+)\”))(?:@)(?:(?:(?:[A-Za-z0-9](?:[-A-Za-z0-9]{0,61}[A-Za-z0-9])?)(?:\\.[A-Za-z0-9](?:[-A-Za-z0-9]{0,61}[A-Za-z0-9])?)*)|(?:\\[(?:(?:(?:(?:(?:[0-9]|(?:[1-9][0-9])|(?:1[0-9][0-9])|(?:2[0-4][0-9])|(?:25[0-5]))\\.){3}(?:[0-9]|(?:[1-9][0-9])|(?:1[0-9][0-9])|(?:2[0-4][0-9])|(?:25[0-5]))))|(?:(?:(?: )*[!-Z^-~])*(?: )*)|(?:[Vv][0-9A-Fa-f]+\\.[-A-Za-z0-9._~!$&'()*+,;=:]+))\\])))(?:(?:(?:(?: )*(?:(?:(?:\\t| )*\\r\\n)?(?:\\t| )+))+(?: )*)|(?: )+)?$”

This expression is the result of setting the “Validation” preset in the app, and pasted as one string. Note that it only tests against “local-part@domain” style addresses, not those and “mailbox” specs (DisplayName <local-part@domain>”. If that’s what you want, then check the appropriate box and generate a different one!

The Xcode project used to generate this Mac app can be found on my github site with the (historical) name of EmailAddressFinder.