The Daily Dump

May 05, 2005

Blogging | Movable Type

Cruft-Free, Future-Proof URLs

I recently embarked on a crusade to rid my Movable Type blog of all things cruft. Below is a bit of a spiel on what cruft is, followed by the process I went through to decruftify Movable Type.

What is Cruft?

Cruft can be a lot of things. Crufty can be how you feel after a night of enthusiastic inebriation. Cruft can also be what you hear when a New Zealander tries to say "craft". Then there's the cruft you find in URLs. What is URL cruft? Well, it can be loosely defined as anything in a URL that isn't really necessary. Taken further, it can be defined as anything that detracts from the usability of a URL. URLs need to be usable -- users do pay attention to them, as do search engines. They provide a vital clue as to where in a site a user is; and, to both the user and search engine, what kind of information the current page displays.

Jakob Nielson says quite a bit on the usability of URLs, the main points of which are below:

  1. A URL should be based on a domain name that is easy to remember and spell.
  2. A URL should be short and easy to type.
  3. A URL should expose the structure of the site and the information it contains.
  4. A URL should allow the user to move to higher levels of information by hacking off parts of the URL.
  5. A URL should be permanent (to avoid link-rot).

The last point there deals with future-proofing and the one aspect of cruft removal we're mainly interested in. In order for a URL to be truly permanent it needs to be application independent. Better yet, platform independent. Your blogging application might one day become obsolete, or you might simply find a better or cheaper one. The new blogging application that you've found could potentially run on any platform. Who knows what you might migrate to -- whatever it is, you don't want your old URLs to break.

This means anything in the URL that is application dependent, like ids, or platform dependent, like ".html" or ".php" file extensions, is cruft and should be removed to make the URL future-proof and permanent.

The Possibilities

There is some well-known prior art on Movable Type cruft removal, most notably by Mark Pilgrim and Már Örlygsson. Mark's approach is to remove the file extension from the URL, so a URL like "2005/05/05/cruft-free-future-proof-urls.html" becomes "2005/05/05/cruft-free-future-proof-urls". The trick is to tell the web server to serve extensionless files as HTML, or to map them to a specific engine (like PHP, ASP, ASP.NET, etc.). Unfortunately, this kind of file mapping isn't always possible on a hosted IIS sever -- there are no Apache server ".htaccess" files to save the day. So this makes Már's directory-based approach more appropriate for me.

Már's approach has no file names at all, let alone file extensions. A URL like "2005/05/05/cruft-free-future-proof-urls.html" becomes "2005/05/05/cruft-free-future-proof-urls/". The beauty of a directory-based URL like this is that it asks the web server to serve up the first default file it can find in that directory. Default files can be configured to be anything. Right now mine are set to "index.htm" and "index.html", but I could easily append "Default.aspx" sometime in the future if I migrate to an ASP.NET blogging application. Directory-based URLs are probably the most platform ambiguous URL strategy possible, and consequently, in my opinion, the most future-proof.

So enough of the rambling, lets get on with outlining the URL strategy for this blog.

My Blog URL Strategy

I'm using a Preferred Archive Type (in Weblog Config) of Individual, so the permalinks for my entries go something like this...

www.adamboddington.com/blog/2005/05/05/cruft-free-future-proof-urls/

Short? No, it isn't, mostly because I've included the keywords if any, otherwise the title of the entry in the URL. I've done this for search engine reasons which I've described in my previous post Optimising Movable Type for Google. The downside of putting stuff like this in the URL is that I can't really change my keywords or title without breaking the entry permalink. I'm not too worried about that though -- if I need to do it in the future, I can set up some kind of redirect for the affected entry.

The URL is hackable though. The user can get to higher levels of information simply by removing directories from the URL. My daily archives go something like this...

www.adamboddington.com/blog/2005/05/05/

My monthly archives go something like this...

www.adamboddington.com/blog/2005/05/

My yearly archives go something like this...

www.adamboddington.com/blog/2005/

The yearly archive is simply a twelve month calendar with links to all the months and days with entries. Once I get it working correctly I'll blog how I did it.

Finally, my category archives go something like this...

www.adamboddington.com/blog/movable-type/

Note that I've included a "blog" directory in my URL. I've done this simply because I foresee a time when I'll use www.adamboddington.com for something more than just a blog. When that happens, the "blog" directory will become the main page for the blog itself. Until that time though, I'll keep showing blog entries on www.adamboddington.com and use the "blog" directory as my archive index page.

And that's the URL strategy for my blog. The remaining sections detail the process I used to achieve it in Movable Type, mostly adapted from Már's approach.

Update Archive URLs

The first thing to do is change the archive URLs from their ".html" extensions to "/index.html". This will move all the entry specific information out of the file name and into the directory, a directory that will contain only a single "index.html" file inside of it.

In the blog administration screen click on Weblog Config, and then on Archive Files. Update the Archive File Template value for each archive type. My values go something like this...

  • Individual
    <$MTArchiveDate format="%Y/%m/%d"$>/<MTIfNotEmpty var="EntryKeywords"><$MTEntryKeywords dashify="1"$><MTElse><$MTEntryTitle dashify="1"$></MTElse></MTIfNotEmpty>/index.html
  • Daily
    <$MTArchiveDate format="%Y/%m/%d"$>/index.html
  • Monthly
    <$MTArchiveDate format="%Y/%m"$>/index.html
  • Yearly (setup as a monthly archive)
    <$MTArchiveDate format="%Y"$>/index.html
  • Category
    <$MTArchiveCategory dashify="1"$>/index.html

Whatever your URL strategy is, the important thing is to move all the Movable Type tags into the directory and leave none in the file name. Use whatever tags you like, title, keywords, or even just the date like Már does. It's a good idea, however, to avoid using any kind of unique id for the reasons discussed above. Unique ids are hard to keep the same in the event of a migration and will end up breaking entry permalinks. Feel free to switch "index.html" with "index.php" or whatever default file type you're using.

Update Archive Links

Whenever Movable Type renders an archive link it will include "index.html" or whatever file name we specified above. The file name needs to be removed.

To do this we're going use regular expressions, so the first step is to download and install Brad Choate's Regex plugin. The installation instructions that come with the plugin are pretty clear. Once the plugin is installed, place the following regex definition at the top of every template that contains archive links.

<MTAddRegex name="removeFileName">s|/index\.html$|/|g</MTAddRegex>

This regular expression looks for "/index.html" at the end of a URL. If it finds "/index.html", it will replace it with "/". Applied to our archive links, "2005/05/05/cruft-free-future-proof-urls/index.html" becomes "2005/05/05/cruft-free-future-proof-urls/". Exactly what we're after.

(Feel free to replace "index" and "html" in the regex definition with whatever default file type you're using. For example, if you're using "Default.aspx", your regular expression should be "s|/Default\.aspx$|/|g".)

To apply the regex to the archive links, use the global tag attribute "regex" wherever <$MTArchiveLink$> and <$MTEntryLink$> are used.

<$MTArchiveLink regex="removeFileName"$>
<$MTEntryLink regex="removeFileName"$>

Don't worry about <$MTEntryPermalink$>. We'll deal with that when we fix the trackback pings.

Fix Trackback Pings

Whenever you ping another blog from one of your entries, Movable Type sends the permalink for your entry as part of the ping. By default the permalink will include a file name, which isn't what we want. To fix this we're going to modify the Movable Type source code to do our permalinks correctly.

Look for "/lib/MT/Entry.pm" in the Movable Type installation folder. Open it up in a text editor and search for "sub archive_url". Locate the following line at the end...

$url . $entry->archive_file(@_);

Replace it with this...

$url .= $entry->archive_file(@_);
$url =~ s|/index\.html$|/|g;
$url;

(Note, if you've changed the regex definition used above, make sure you change it here as well. Everything after "=~" and before ";" on the second line is the regular expression.)

Now when Movable Type sends out an entry permalink as part of a trackback ping, it will do so without a file name. And as a pleasant side effect, the <$MTEntryPermalink$> tag will also show the correct URL without having to use regex. Fantastic.

By the way, if you read up on Mar's approach, you will have noticed he also modified the "permalink" subroutine. This is only necessary if you're using a Preferred Archive Type (in Weblog Config) of Daily, Monthly or Category. We're using Individual, as outlined in the blog URL strategy, so we can skip this change.

And that's all I've done to get my Movable Type URLs cruft-free and future-proof. Már has done a bit more to make his comment permalinks future-proof as well, but I haven't attempted this as yet. Maybe sometime in the future I'll get around to it.

Posted by Adam Boddington at 09:43 AM | Comments (0)

Comments

Post a comment




Remember Me?

(you may use HTML tags for style)