URL Verifier, Parser, and Scraper in Objective-C

I just updated the second version of my github project, URLFinderAndVerifier. Needing to verify a http URL as it was typed in to toggle an “Accept” button, I searched the web for such a thing, and unfortunately found oodles of them, none of which had any credentials.

Looking closer on another day, I found a site run by Jeff Roberson where he had meticulously worked through RFC-3896, constructing simple regular expressions then combining them to create a full regular expression that should properly parse any valid URL. Note that the standard allows much of what we think of as necessary information to be empty.

So I started with his expressions, then tweaked them slightly to handle more real world conditions, such as forcing a person to enter (the optional) ‘/’ at the start of the path segment (which acts as a end of authority marker).

Using my previous Email verifier test harness, I constructed a test engine that lets you test out various combinations of options. It has three components:

– construct the regular expressions for use in a text file or a NSString constant

– URLSearcher, a class that uses to regular expression to find or verify URLs

– a test app that you use to test various URLs, or “live” input mode

AppScreenShot

All regular expressions exist in text files that are heavily commented. A pre-processor reads these files and removes comments and extraneous characters (like spaces). This makes the files much more readable and understandable. These partial regular expressions that are used to build a full regular expression based on the options you select in the GUI.

Users can optionally enter Unicode characters, which the regular expression can optionally allow. Given a verified URL, a utility function in URLSearcher converts these syntactically correct strings to ‘%’ encoded ones that fully meet the RFC spec (Europeans should like this!)

Other options are to look for every scheme, or http/https, or http/https/ftp and the ability to spec capture groups to see URL subcomponents: user, host, port, path, query, and fragment.

PS: if you end up using this it would be great if you could give the StackOverflow post an up-arrow.

Advertisements

3 thoughts on “URL Verifier, Parser, and Scraper in Objective-C

  1. Hi again,

    I’ve discovered an anomaly in the regex that URLSearcher uses. Additional characters are extracted on the end of the URL. For example, from the following text, the regex will extract the trailing “).” along with the URL, giving a result of http://www.apple.com).

    bla bla (http://www.apple.com). Bla bla

    Regards
    Jools

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s