URL Verifier, Parser, and Scraper in Objective-C

I just updated the second version of my github project, URLFinderAndVerifier. Needing to verify a http URL as it was typed in to toggle an “Accept” button, I searched the web for such a thing, and unfortunately found oodles of them, none of which had any credentials.

Looking closer on another day, I found a site run by Jeff Roberson where he had meticulously worked through RFC-3896, constructing simple regular expressions then combining them to create a full regular expression that should properly parse any valid URL. Note that the standard allows much of what we think of as necessary information to be empty.

So I started with his expressions, then tweaked them slightly to handle more real world conditions, such as forcing a person to enter (the optional) ‘/’ at the start of the path segment (which acts as a end of authority marker).

Using my previous Email verifier test harness, I constructed a test engine that lets you test out various combinations of options. It has three components:

– construct the regular expressions for use in a text file or a NSString constant

– URLSearcher, a class that uses to regular expression to find or verify URLs

– a test app that you use to test various URLs, or “live” input mode

AppScreenShot

All regular expressions exist in text files that are heavily commented. A pre-processor reads these files and removes comments and extraneous characters (like spaces). This makes the files much more readable and understandable. These partial regular expressions that are used to build a full regular expression based on the options you select in the GUI.

Users can optionally enter Unicode characters, which the regular expression can optionally allow. Given a verified URL, a utility function in URLSearcher converts these syntactically correct strings to ‘%’ encoded ones that fully meet the RFC spec (Europeans should like this!)

Other options are to look for every scheme, or http/https, or http/https/ftp and the ability to spec capture groups to see URL subcomponents: user, host, port, path, query, and fragment.

PS: if you end up using this it would be great if you could give the StackOverflow post an up-arrow.