Someone requested an algorithm to extract email addresses from a random chunk of text. It so happens that I had tripped on a site that evaluates regular expressions to verify if a string is in fact a valid email address against a huge list of both legal and illegal addresses. It’s truly amazing what strings are in fact legal!
I used what is now the second best algorithm for the Lot18 wine app, and so I wanted to see if this regex could be applied to a large text file with embedded addresses: it turned out the regex I had been using did not work for this purpose, but the current (as of 4/1/2013) best performer did.
However, neither regex handles “mailbox” detection as defined in RFC6068. Mailboxes look like “Joe Smith at IBM” <firstname.lastname@example.org>. Since the “name” portion of the mailbox usually has important information contained in it, and wanted to see if I could find a reasonable algorithm to add it to the detector, a task I did in fact complete, and is open sourced on github as EmailAddressFinder.
But reading all the RFCs lead me to yet another email related issue, which is recovering “mailto” links within HTML source. In the end this turned out to be a much easier and elegant task, as the format of mailto links is quite regular. I had a bit of time recently and was able to add a mailto extractor to this project.
The only other addition to this project I can think of would be to provide a merged search: first detect the mailto tags and remove those from the original string, then secondly harvest the remaining text for other addresses. Most likely any given string would only contain one or the other, so I’m not doing anything until someone asks for it.
PS: if you end up using this it would be great if you could give the StackOverflow post an up-arrow.