Generate the regex for a TLD hostname from Perl
This was a quick, fun exercise to remind me that I can still write Perl. It fetches the list of TLDs from IANA, does a quick bit of munging, then renders a regex which should match any valid FQDN:
#!/usr/bin/env perl
use strict;
use warnings;
use LWP::Simple;
my $fqdn_regex;
if (my $content = get('http://data.iana.org/TLD/tlds-alpha-by-domain.txt')) {
$fqdn_regex = '(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+(?:';
$fqdn_regex .= join('|', grep (!/^(#|xn)/i, (split /\n/, lc($content))));
$fqdn_regex .= ')';
}
my $regex = $fqdn_regex . '(?:\s|\/|$)';
print "$regex\n";
Several caveats:
- It doesn’t match IPv4 dotted quad nor IPv6 ::-notation
- It intentionally ignores Internationalized Domain Name in Applications (IDNA) domains
- It borrows from my favorite reference for this, regular-expressions.info’s page on email address regexes.
Maybe I’ll extend it for completeness and/or rewrite it in Ruby someday. Until then, it’ll always be ~/bin/tld_regex for me.
blog comments powered by Disqus
Published
18 August 2009