Categories
Apache Linux Web Site Technologies

Turning Apache into a Redirect Factory

Intro
I’m getting a little more used to Apache. It’s a strange web server with all sorts of bolt-on pieces. The official documentation is horrible so you really need sites like this to explain how to actually do useful things. You needs real, working examples. In this example I’m going to show how to use the mod_rewrite engine of Apache to build a powerful and convenient web server whose sole purpose in life is for all types of redirects. I call it a redirect factory.

Which Redirects Will it Handle
The redirects will be read in from a file with an easy, editable format. So we never have to touch our running web server. We’ll build in support for the types of redirect requests that I have actually encountered. We don’t care what kind of crazy stuff Apache might permit. You’ll pull your hair out trying to understand it all. All redirects I have ever encountered fall into a relatively small handful of use cases. Ordered by most to least common:

  1. host -> new_url
  2. host/uri[Suffix] -> new_fixed_url (this can be a case-sensitive or case-insensitive match to the uri)
  3. host/uri[Suffix] -> new_prefix_uri[Suffix] (also either case-sensitive or not)

So some examples (not the best examples because I don’t manage drj.com or drj.net, but pretend I did):

  1. drj.com/WHATEVER -> http://drjohnstechtalk.com/
  2. www.drj.com -> http://drjohnstechtalk.com/
  3. drj.com/abcPATH/Preserve -> http://drjohnstechtalk.com/abcPATH/Preserve
  4. drj.com/defPATH/Preserve -> http://drjohnstechtalk.com/ghiPATH/Preserve
  5. drj.com/path/with/slash -> http://drjohnstechtalk.com/other/path
  6. drj.com/path/with/prefix -> http://drjohnstechtalk.com/other/path
  7. drj.net/pAtH/whatever -> https://drjohnstechtalk.com/straightpath
  8. drj.net/2pAtH/stuff?hi=there http://drjohnstechtalk.com/2straightpath/stuff?hi=there
  9. my.host -> http://regular-redirect.com/
  10. whatever-host.whatever-domain/whatever-URI -> http://whatever-new-host.whatever-new-domain/whatever-new-URI

All these different cases can be handled with one config file. I’ve named it redirs.txt. It looks like this:

# redirs file
# The default target has to be listed first
defaultTarget   D       http://www.drjohnstechtalk.com/blog/
# hosts with URI-matching grouped together
# available flags: "P" - preserve part after match
#                  "C" - exact case match of URI
 
# Begin host: drj.com:www.drj.com - ":"-separated list of applicable hostnames
/                       http://drjohnstechtalk.com/
/abc    P       http://drjohnstechtalk.com/abc
/def    P       http://drjohnstechtalk.com/ghi
/path/with/slash https://drjohnstechtalk.com/other/path
/path/with/prefix P  https://drjohnstechtalk.com/other/path
# end host drj.com:www.drj.com
 
# this syntax - host/URI - is also OK...
drj.net/ter             http://drjohnstechtalk.com/terminalredirect
drj.net/pAtH    C       http://drjohnstechtalk.com/straightpath
drj.net/2pAtH   CP      http://drjohnstechtalk.com/2straightpath
 
# hosts with only host-name matching
my.host                 http://regular-redirect.com/
www.drj.edu             http://education-redirect.edu/edu-path

The Apache configuration file piece is this:

# I really don't think this does anything other than chase away a scary warning in the error log...
RewriteLock ${APACHE_LOCK_DIR}/rewrite_lock
<VirtualHost *:90>
# Inspired by the dreadful documentation on http://httpd.apache.org/docs/2.0/mod/mod_rewrite.html
RewriteEngine on
RewriteMap  redirectMap prg:conf/vhosts/redirect.pl
#RewriteCond ${lowercase:%{HTTP_HOST}} ^(.+)$
RewriteCond ${redirectMap:%{HTTP_HOST}%{REQUEST_URI}} ^(.+)$
# %N are backreferences to RewriteCond matches, and $N are backreferences to RewriteRule matches
RewriteRule ^/.* %1 [R=301,L]
</VirtualHost>

Remember I split up apache configuration into smaller files. So that’s why you don’t see the lines about logging and what port to listen on, etc. And the APACHE_LOCK_DIR is an environment variable I set up elsewhere. This file is called redirect.conf and is in my conf/vhosts directory.

In my main httpd.conf file I extended the logging to prefix the lines in the access log with the host name (since this redirect server handles many host names this is the only way to get an idea of which hosts are popular):

...
    LogFormat "%{Host}i %h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" combined
...

So a typical log line looks something like the following:

drj.com 201.212.205.11 - - [10/Feb/2012:09:09:07 -0500] "GET /abc HTTP/1.1" 301 238 "http://www.google.com.br/url?sa=t&rct=j&q=drjsearch" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; .NET4.0C; .NET4.0E)"

I had to re-compile apache because originally my version did not have mod_rewrite compiled in. My description of compiling Apache with this module is here.

The directives themselves I figured out based on the lousy documentation at their official site: http://httpd.apache.org/docs/2.0/mod/mod_rewrite.html. The heavy lifting is done in the Perl script because there you have some freedom (yeah!) and are not constrained to understand all their silly flags. One trick that does not seem documented is that you can send the full URL to your mapping program. Note the %{HTTP_HOST}%{REQUEST_URI} after the “:”.

I tried to keep redirect.pl brief and simple. Considering the many different cases it isn’t too bad. It weighs in at 70 lines. Here it is:

#!/usr/bin/perl
# Copyright work under the Artistic License, http://www.opensource.org/licenses/Artistic-2.0
# input is $HTTP_HOST$REQUEST_URI
$redirs = "redirs.txt";
# here I only want the actual script name
$working_directory = $script_name = $0;
$script_name =~ s/.*\///g;
$working_directory =~ s/\/$script_name$//g;
$finalType = "";
$DEBUG = 0;
$|=1;
while (<STDIN>) {
  chomp;
  ($host,$uri) = /^([^\/]+)\/(.*)/;
  $host = lc $host;
# use generic redirect file
  open(REDIRS,"$working_directory/$redirs") || die "Cannot open redirs file $redirs!!\n";
  $lenmatchmax = -1;
  while(<REDIRS>) {
# look for alternate names section
    if (/#\s*Begin host\s*:\s*(\S+)/i) {
      @hostnames = split /:/,$1;
      $pathsection = 1;
    } elsif (/#\s*End host/i) {
      $pathsection = 0;
    }
    @hostnames = () unless $pathsection;
    next if /^#/ || /^\s*$/; # ignore comments and blank lines
    chomp;
    $type = "";
# take out trailing spaces after the target URL
    s/\s+$//;
    if (/^(\S+)\s+(\S{1,2})\s+(\S+)$/) {
      ($redirsURL,$type,$targetURL) = ($1,$2,$3);
    } else {
       ($redirsURL,$targetURL) = /^(\S+)\s+(\S+)$/;
    }
# set default target if specified. It has to come at beginning of file
    $finalURL = $targetURL if $type =~ /D/;
    $redirsHost = $redirsURI = $redirsURIesc = "";
    ($redirsHost,$redirsURI) = $redirsURL =~ /^([^\/]*)\/?(.*)/;
    $redirsURIesc = $redirsURI;
    $redirsURIesc =~ s/([\/\?\.])/\\$1/g;
    print "redirsHost,redirsURI,redirsURIesc,targetURL,type: $redirsHost,$redirsURI,$redirsURIesc,$targetURL,$type\n" if $DEBUG;
    push @hostnames,$redirsHost unless $pathsection;
    foreach $redirsHost (@hostnames) {
    if ($host eq $redirsHost) {
# assume case-insensitive match by default.  Use type of 'C' to demand exact case match
# also note this matches even if uri and redirsURI are both empty
      if ($uri =~ /^$redirsURIesc/ || ($type !~ /C/ && $uri =~ /^$redirsURIesc/i)) {
# find longest match
        $lenmatch = length($redirsURI);
        if ($lenmatch > $lenmatchmax) {
          $finalURL = $targetURL;
          $finalType = $type;
          $lenmatchmax = $lenmatch;
          if ($type =~ /P/) {
# prefix redirect
            if ($uri =~ /^$redirsURIesc(.+)/ || ($type !~ /C/ && $uri =~ /^$redirsURIesc(.+)/i)) {
              $finalURL .= $1;
             }
          }
        }
      }
    } # end condition over input host matching host from redirs file
    } # end loop over hostnames list
  } # end loop over lines in redirs file
  close(REDIRS);
# non-prefix re-direct. This is bizarre, but you have to end URI with "?" to kill off the query string, unless the target already contains a "?", in which case you must NOT add it! Gotta love Apache...
  $finalURL .= '?' unless $finalType =~ /P/ || $finalURL =~ /\?/;
  print "$finalURL\n";
} # end loop over STDIN

The nice thing here is that there are a couple of ways to test it, which gives you a sort of cross-check capability. Of course I made lots of mistakes in programming it, but I worked through all the cases until they were all right, using rapid testing.

For instance, let’s see what happens for www.drj.com. We run this test from the development server as follows:

> curl -i -H ‘Host: www.drj.com’ ‘localhost:90’

HTTP/1.1 301 Moved Permanently
Date: Thu, 09 Feb 2012 15:24:25 GMT
Server: Apache/2
Location: http://drjohnstechtalk.com/
Content-Length: 235
Content-Type: text/html; charset=iso-8859-1
 
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>301 Moved Permanently</title>
</head><body>
<h1>Moved Permanently</h1>
<p>The document has moved <a href="http://drjohnstechtalk.com/">here</a>.</p></body></html>

And from the command line I test redirect.pl as follows:

> echo “www.drj.com/”|./redirect.pl

http://drjohnstechtalk.com/?

That terminal “?” is unfortunate, but apparently you need it to kill off any possible query_string.

You want some more? OK. How about matching a host and the initial path in a case-insensitive manner? No problem, we’re up to the challenge:

> curl -i -H ‘Host: DRJ.COM’ ‘localhost:90/PATH/WITH/SLASH/stuff?hi=there’

HTTP/1.1 301 Moved Permanently
Date: Thu, 09 Feb 2012 15:38:12 GMT
Server: Apache/2
Location: https://drjohnstechtalk.com/other/path
Content-Length: 246
Content-Type: text/html; charset=iso-8859-1
 
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>301 Moved Permanently</title>
</head><body>
<h1>Moved Permanently</h1>
<p>The document has moved <a href="https://drjohnstechtalk.com/other/path">here</a>.</p></body></html>

Refer back to the redirs file and you see this is the desired behaviour.

We could go on with an example for each case, but we’ll conclude with one last one:

> curl -i -H ‘Host: DRJ.NET’ ‘localhost:90/2pAtHstuff?hi=there’

HTTP/1.1 301 Moved Permanently
Date: Thu, 09 Feb 2012 15:44:37 GMT
Server: Apache/2
Location: http://drjohnstechtalk.com/2straightpathstuff?hi=there
Content-Length: 262
Content-Type: text/html; charset=iso-8859-1
 
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>301 Moved Permanently</title>
</head><body>
<h1>Moved Permanently</h1>
<p>The document has moved <a href="http://drjohnstechtalk.com/2straightpathstuff?hi=there">here</a>.</p>
</body></html>

A case-sensitive, preserve match. Change “pAtH” to “path” and there is no matching line in redirs.txt so you will get the default URL.

Creating exceptions

Eventually I wanted to have an exception – a URI which should be served with a 200 status rather than redirected. How to handle?

# Inspired by the dreadful documentation on http://httpd.apache.org/docs/2.0/mod/mod_rewrite.html
        RewriteEngine on
# just this one page should NOT be redirected
        Rewriterule ^/dontredirectThisPage.php - [L]
        RewriteMap  redirectMap prg:redirect.pl
        ... etc ...

The above apache configuration snippet shows that I had to put the page which shouldn’t be redirected at the top of the ruleset and set the target to “-“, which turns off redirection for that match, and make this the last executed Rewrite rule. I think this is better than a negated match (!) which always gets complicated.

Conclusion
A powerful redirect factory was constructed from Apache and Perl. We suffered quite a bit during development because of incomprehensible documentation. But hopefully we’ve saved someone else this travail.

References and related
This post describes how to massage Apache so that it always returns a maintenance page no matter what URI was originally requested.
I have since learned that another term used in the industry for rediect server is persistent URL (PURL). It’s explained in Wikipedia by this article: https://en.wikipedia.org/wiki/Persistent_uniform_resource_locator

Leave a Reply

Your email address will not be published. Required fields are marked *