Published: 2021/04/01
Last updated: 2021/04/24
Talking to some friends, I’ve come to realise that the way I put together this site is a bit unusual, so I figured it might be fun to explain my procedures and rationale in the hope that someone might find it useful, or at least interesting.
First off, I’ll describe the process I use, and then show the scripts in full for anyone interested in further details.
The way I write articles is a bit less home-spun than a lot of people here. Instead of firing up an IDE (or using the editor on Neocities), I open up Vim and write what I want in plain text, utilising Markdown syntax. I then process this text using Pandoc, a conversion tool, to go from markdown to a polyglot XHTML/HTML5 output. Pandoc has its own markdown extensions that I sometimes utilise, especially the syntax that allows a level 1 heading to do double duty as a <title> element, e.g.
% How This Site is Built
This creates both the header you see at the top of this page as well as the title you see in your Web browser’s title bar.
Why introduce this layer of abstraction?
It’s a matter of design philosophy. I prefer markup systems that separate content from presentation so that I can focus on the former with minimal distraction from the latter. In systems that combine both, such as word processors, you have to worry about dealing with both at the same time, leading to mistakes such as missing the fact that you have Bold turned on when you started typing the rest of your sentence due to the marker for it being invisible, or, in the case of HTML, things such as forgetting to close out tags, or else mistyping them.
This methodology also makes it easier to alter articles at a later time without having to be concerned with accidentally breaking markup in the process, and makes it easier to see what changed in the actual content when reviewing differentials (which I’ll discuss a bit more later).
Perhaps most usefully, the minimal, natural-feeling syntax inherent to Markdown also means that the source of the article is massively easier to read and edit as a human. I can even send people copies of the markdown document to proof-read and criticise without requiring them to do or know anything special and while still keeping a reasonably clean appearance.
Now, having an article written (or needing to test its appearance), the next step is to render it. As stated previously, pandoc is used for the initial transformation, but that’s not all that goes into it. I wrote a shell script that accepts a markdown file as input, runs it through pandoc, and then applies several “patches” to the rendered HTML, some conditional and some absolute. These patches do things such as adding a table of contents if a specific placeholder line is found, replacing pandoc’s in-line styling with a link to my CSS file, and adding global elements, such as the “back” links at the bottom of every page. The output is always a single, well-structured HTML file.
Why take this more programmatic approach?
It allows me a lot of flexibility – if I feel the need to re-arrange the structure of the web site, I can do so with minimal fuss. To facilitate such an occasion, I even have a Makefile that fetches every markdown file in my directory and runs them through the shell script one at a time.
Sharp-eyed observers may have noted that every internal link on this site is relative instead of absolute. In order to facilitate this, my shell script keeps track of important elements such as the relative locations of the index page, the page being rendered, and the style sheet. The reasoning for this is simple: it makes it a lot easier to do local testing by avoiding the need to have a local web server running – I can test the entire site from top to bottom without actually putting it online. I also avoid the need to search-and-replace in every file if I end up having to change hosts for some reason.
Another benefit in particular centres around the CSS template. By having a single, global CSS file on every page, I can create something that will work in 99% of cases and affect changes by merely editing one file and rebuilding, instead of having to alter dozens of files. Then, should I require any specific overrides, I can include them in-line in the specific document that requires them, as markdown allows for raw HTML where needed.
It’s probably obvious by now that I generate my RSS feed programatically as well. After updating the front page with a link to the new article and rendering all of my pages, I run a Perl script I wrote to parse my index page, pull out local links, find the page they refer to, scan that for the “Last updated” line, parse that date, sanitise it (substituting characters that might break the feed, such as reserved HTML symbols like quotes or ampersands), sort it by date, cut out old entries, and then generate the feed’s XML file.
I’d rather put forth a few hours of effort up-front and never have to worry about doing boring busywork afterwards. I suppose I could say that about the entire process, really, but that was my chief motivation in this case.
The final thing I do before uploading the pages is to carefully commit all of my changes around a given article to a local Git repository. This process is entirely manual, as it allows me to create careful notes as to precisely what I changed and why.
Truthfully, I’m absolutely amazed that there’s people not doing this. Git is a wonderful system, and is just as suited to prose as it is to programming. The basic commands can be learned in minutes, and it has historically saved me a lot of time and trouble over the years.
For the uninitiated, Git is a version control system (VCS) created by Linus Torvalds, better-known as the creator of the Linux kernel. It allows you to track and commit changes to arbitrary files in such a way as to create a precise history of precisely how, when, and by whom things were changed. More simply, you can think of it as very similar to creating (annotated) save states in an emulator or virtual machine. One of the chief benefits is that it lets you experiment without having to worry about massively breaking everything created up to that point.
To recap, my general workflow is as follows:
By having such a programmatic approach, I am afforded the luxury of not having to care overmuch about presentation, the ability to create simple overrides when I do need to care, the ability to easily read and edit the source material, the ability to re-structure with minimal fuss, the ability to quickly create an accurate and self-cleaning RSS feed, and a powerful revision system.
Now, let’s look at the scripts used at the time this article was written.
First off, the Makefile:
.PHONY: all
all:
for i in `find . -name "*.md"`; do ./build.sh "$$i"; done
./rss.pl
.PHONY: clean
clean:
for i in `find . -name "*.md"`; do rm $$(dirname $$i)/$$(basename $$i .md).html; done
rm rss.xml
Very straightforward; both the build and the clean are straightforward invocations of find(1) followed by an action. The slightly more complex cleaning action essentially removes any HTML files with an identically-named markdown file.
Next, the main build script:
#!/bin/sh
# Tools needed:
# - dirname
# - grep
# - pandoc
# - realpath
# - sed
# Assumptions:
# - index.md is named as such and in the same directory as this script.
# - CSS is global and located in css/style.css.
# - All documents requiring a TOC have a placeholder with "<!-- TOC -->" in
# them.
# - Pandoc is v2.11.4 (Earlier or later versions may have different template
# text)
# Import the argument and get paths to important elements
i="$1"
BASEPATH="$( cd -- "$(dirname "$0")" >/dev/null 2>&1 || exit 2 ; pwd -P )"
INDEX="$BASEPATH/index.html";
INDEXPATH=$(realpath --relative-to "$(dirname "$i")" "${INDEX}")
STYLE="$BASEPATH/css/style.css"
STYLEPATH=$(realpath --relative-to "$(dirname "$i")" "${STYLE}")
MASCOT="$BASEPATH/img/momiji.png"
MASCOTPATH=$(realpath --relative-to "$(dirname "$i")" "${MASCOT}")
MASCOTALT='I came here to awoo at you'
RSS="$BASEPATH/rss.xml"
RSSPATH=$(realpath --relative-to "$(dirname "$i")" "${RSS}")
PATCHES="$(dirname "$i")/patches.d"
# Determine if a TOC is needed; if so, render the page with one, relocate it to
# below the placeholder, removing that in the process...
TOC=$(grep -c '<!-- TOC -->' "$i")
if [ "$TOC" -eq 1 ]; then
pandoc -V lang=en --toc -s -c "$STYLEPATH" "$i" |
sed \
-e '/<nav id="TOC" role="doc-toc">/,/<\/nav>/{H;d}' \
-e '/<!-- TOC -->/{g; s/$/\n/}' \
> "$(dirname "$i")"/"$(basename "$i" .md)".html
# ...otherwise, just render the page normally.
else
pandoc -V lang=en -s -c "$STYLEPATH" "$i" > "$(dirname "$i")"/"$(basename "$i" .md)".html
fi
# Remove the inline styling, the IE9 block, and add the mascot image
sed -i \
-e '/<style>/,/<\/style>/d' \
-e '/<!--\[if lt IE 9\]>/,/<!\[endif\]-->/d' \
-e 's|</body>|\n<div id="mascot">\n<p><img src="'"$MASCOTPATH"'" alt="'"$MASCOTALT"'" /></p>\n</div>\n&|' \
"$(dirname "$i")"/"$(basename "$i" .md)".html
# Add the "back" link to everything except the index
if [ "$(printf "%s" "$i" | grep -Ec '^(./)?index.md$')" -eq 0 ]; then
sed -i -e 's|</body>|\n<p><a href="'"$INDEXPATH"'">Back to main page</a></p>\n&|' "$(dirname "$i")"/"$(basename "$i" .md)".html
fi
# Add the RSS link to the index only
if [ "$(printf "%s" "$i" | grep -Ec '^(./)?index.md$')" -eq 1 ]; then
sed -i -e 's|</title>|&\n <link href="'"$RSSPATH"'" rel="alternate" type="application/rss+xml" title="What'\''s new on ShadowM00n'\''s Summit" />|' \
"$(dirname "$i")"/"$(basename "$i" .md)".html
fi
# Find and invoke any patches named after the article
if [ -d "$PATCHES" ]; then
find "$PATCHES" -name "$(basename "$i" .md).*" -type f -executable -exec {} "$i" \;
fi
I think the commentary is largely sufficient for this, but there are a few odd things in here.
First, note the various paths and how they’re determined via a clever mix of realpath(1) and dirname(1). This is how I’m able to keep all of my important links relative.
Next, note that I have to use a different pandoc(1) invocation based on whether or not I have a specific placeholder in the text. When rendering such, pandoc likes to place the new TOC at the very top of the document, so I use some sed(1) scripting to first slurp up the TOC from start to finish into sed’s internal holding buffer, and then delete it; it then finds my marker and gets the data from the buffer and replaces the marker with it, then adds a newline at the end for tidiness.
Regardless of which invocation was run, I then remove some unneeded material inserted by pandoc, and then insert Momiji. Note especially the mess of quotation marks I have to use in order to interpolate text in this format.
Then, a simple conditional invocation that runs on every page except the index in order to add a relative link back to the index.
Next, the inverse – on the index page only, add the RSS feed’s advertisement to the header. This is distinct from the RSS link at the bottom of the index page and is not strictly necessary, but it is a courteous thing to do. This is what lets compatible Web browsers offer to subscribe to the RSS feed.
Finally, see if there’s a directory of local patches, and execute all executable files therein matching the article name, passing in the file name for convenience. This is mainly used for organisational purposes, taking inspiration from /etc/foo.d/ directories in GNU/Linux systems. That is to say, I can write patches for individual articles without having to bloat up my main build script.
Lastly, the script for the RSS feed:
#!/usr/bin/env perl
use strict;
use warnings;
use re '/xms';
our $VERSION = '1.0.0';
# Modules
use Carp; # Core
use English qw(-no_match_vars); # Core
use DateTime; # CPAN
use POSIX qw(strftime); # Core
my $input = 'index.html';
my $output = 'rss.xml';
my $base_url = 'https://shadowm00n.neocities.org';
my $feed_items;
my $MAX_FEED_ITEMS = 20;
# Extracts the publishing date for the RSS feed by looking for the "Last
# updated" note in both RFC 2822 and UNIX epoch formats
sub extract_pub_date {
my $file = shift;
my ( $year, $month, $day );
open my $fh, '<', "$file" or die "Unable to open $file: $ERRNO\n";
while (<$fh>) {
## no critic [RegularExpressions::ProhibitUnusedCapture] # False alarm
if (/Last[ ]updated:[ ](\d+ \/ \d+ \/ \d+)/) {
( $year, $month, $day ) = split /\//, $1;
last;
}
}
close $fh;
my $dt = DateTime->new(
year => $year,
month => $month,
day => $day,
hour => 0,
minute => 0,
second => 0,
nanosecond => 0,
time_zone => 'GMT' # For privacy
);
my $pub_date = $dt->strftime('%a, %d %b %Y %H:%M:%S %z');
my $epoch_date = $dt->strftime('%s');
return $pub_date, $epoch_date;
}
# Replaces quotes, ampersands, and angle brackets with HTML-friendly versions
sub sanitise_html {
my $string = shift;
$string =~ s/'/'/g;
$string =~ s/"/"/g;
$string =~ s/&/&/g;
$string =~ s/</</g;
$string =~ s/>/>/g;
return $string;
}
## Main
# Read in the index file and scoop up everything in <body>
open my $fh, '<', $input or die "Unable to open $input: $ERRNO\n";
my ($body) = (
do { local $INPUT_RECORD_SEPARATOR = undef; <$fh> }
) =~ /<body>(.*)<\/body>/;
close $fh;
# Gather all the lines that don't have http/https in them and parse out the
# info needed for the RSS feed
for ( split /\n/, $body ) {
if (/<a[ ]href="( (?!http|https) .*?)">(.*?)<\/a>[ ](.*)<\/.*>$/) {
( $feed_items->{$1}{pub_date}, $feed_items->{$1}{epoch_date} )
= extract_pub_date($1);
$feed_items->{$1}{url} = $base_url . "/$1";
$feed_items->{$1}{title} = sanitise_html($2);
$feed_items->{$1}{desc} = sanitise_html($3);
}
}
# Generate the RSS file via a heredoc template
my $template_body;
my $build_date = DateTime->now( time_zone => 'GMT' ) # For privacy
->strftime('%a, %d %b %Y %H:%M:%S %z');
my $count = 0;
# First the body, so it can be added all at once...
for my $feed (
reverse
sort { $feed_items->{$a}{epoch_date} <=> $feed_items->{$b}{epoch_date} }
keys %{$feed_items}
)
{
# Stop processing early if there's too many items
$count++;
if ( $count > $MAX_FEED_ITEMS ) {
last;
}
$template_body .= <<"EOF";
<item>
<title>$feed_items->{$feed}{title}</title>
<description>$feed_items->{$feed}{desc}</description>
<pubDate>$feed_items->{$feed}{pub_date}</pubDate>
<link>$feed_items->{$feed}{url}</link>
<guid>$feed_items->{$feed}{url}</guid>
</item>
EOF
}
# ..and then the rest
my $template = <<"EOF";
<?xml version="1.0" encoding="UTF-8" ?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
<channel>
<title>What's new on ShadowM00n's Summit</title>
<description>A basic RSS feed for those so inclined</description>
<link>$base_url</link>
<lastBuildDate>$build_date</lastBuildDate>
<atom:link href="$base_url/rss.xml" rel="self" type="application/rss+xml" />
$template_body
</channel>
</rss>
EOF
# Write the template out
open $fh, '>', $output or die "Unable to open $output: $ERRNO\n";
print {$fh} $template;
close $fh;
print "$output written\n";
The flow of this isn’t too difficult to understand, really. I read in index.html and scoop up everything in the <body>. This opening invocation is a little odd because I’m essentially trying to emulate the range feature of sed that I used above when scooping out the TOC. I then search the body for all links that don’t start with http(s) – that is, just my internal links. This lets me avoid, for instance, the support button that is currently on the front page linking to the open letter in support of Richard Stallman.
I then run through each page and extract or generate the data I need: the publication date for the RSS feed, which in this case is determined by the “Last updated” line in each file (converted to the form mandated by RFC 2822), a UNIX epoch time to allow me to easily and simply sort the articles by date, a URL, a title, and a description. When I gather the last three, note that I’m getting them from the regular expression that pulled out the URLs to begin with – these come in left to right in between each set of parenthesis (except the one pair beginning with “?!”). The title and description are also run through the sanitiser at this time, in order to replace characters that could break the feed.
I then generate a build date, and then put it all together to create the actual XML file via two heredoc templates. I begin by creating the body and storing it as a scalar. The body is made by appending data to this scalar, with each article’s information being supplied in reverse chronological order (newest first), cutting off once we reach either 20 articles or the end of the list, whichever comes first. At this point, the second template is supplied, and the first interpolated into it in the appropriate place. Lastly, the completed file is spit out, and a notification written to STDOUT.
This is quite a lot to take in at once, and it was quite a bit to write up-front, but I hope you can see the benefits that a system like this can offer to your own workflow. My system probably won’t be a good, or at least immediate, fit to your own site because mine is made specifically for me, but the general principles and ideas can still be useful. Experiment!