Pollen Source: posts/flattening-to-html.poly.pm

📄 posts/flattening-to-html.poly.pm
#lang pollen

◊(define-meta title "Flattening a Site: From Database to Static Files")
◊(define-meta published "2016-11-11")

I just finished converting ◊link["https://howellcreekradio.com"]{a site} from running on a database-driven CMS (Textpattern in this case) to a bunch of static HTML files. No, I don’t mean I switched to a static site generator like Jekyll or Octopress, I mean it’s just plain HTML files and nothing else. I call this “flattening” a site.◊margin-note{I wanted a way to refer to this process that would distinguish it from “archiving”, which to me also connotes taking the site offline. I passed on “embalming” and “mummifying” for similar reasons.}

In this form, a web site can run for decades with almost no maintenance or cost. It will be very tedious if you ever want to change it, but that is fine because the whole point is long-term preservation. It’s a considerate, responsible thing to do with a website when you’re pretty much done updating it forever. Keeping the site online prevents link rot, and you never know what use someone will make of it.

◊section[#:id "how-to-flatpack"]{How to Flatpack}

Before getting rid of your site’s CMS and its database, make use of it to simplify the site as much as possible. It’s going to be incredibly tedious to fix or change anything later on so now’s the time to do it. In particular you want to edit any templates that affect the content of multiple pages:

◊ul{
◊item{Strip out all external dependencies, such as TypeKit and any script-based analytics. In place of Typekit I used a self-hosted font for the headings and just switched to Georgia for the body text.}
◊item{Removed unneeded sections and internal links from the page templates. For example, on my site I eliminated any reference to the privacy policy and the “different ways to subscribe” guide. I also made the “episodes” page one giant list of all the episodes instead of being broken up into 20 pages.}
◊item{Finally, wherever appropriate, edited text in the page templates to make the site’s “archival status” clear. Remove anything that could give the impression that this place is still a going concern.}
}

Next, on your web server, make a temp directory (outside the site’s own directory) and download static copies of all the site’s pages into it with the ◊code{wget} command:

◊blockcode{wget --recursive --domains howellcreekradio.com --html-extension howellcreekradio.com/}

This will download every page on the site and every file linked to on those pages. In my case it included images and MP3 files which I didn’t need. I deleted those until I had only the ◊code{.html} files left.

◊subsection[#:id "fixing-some-links"]{Digression: Mass-editing links and filenames from the command line}

This bit is pretty specific to my own situation but perhaps some will find it instructive. At this point I was almost done, but there was a bit of updating to do that couldn’t be done from within my CMS. My home page on this site had “Older” and “Newer” links at the bottom in order to browse through the episodes, and I wanted to keep it this way. These older/newer links were generated by the CMS with POST-style URLS: ◊code{http://site.com/?pg=2} and so on. When ◊code{wget} downloads these links (and when the ◊code{--html-extension} option is invoked), it saves them as files of the form ◊code{index.html?pg=2.html}. These all needed to be renamed, and the pagination links that refer to them needed to be updated. 

I happen to use ZSH, which comes with an alternative to the standard ◊code{mv} command called ◊code{zmv} that recognizes patterns:

◊blockcode{zmv 'index.html\?pg=([0-9]).html' 'page$1.html'
zmv 'index.html\?pg=([0-9][0-9]).html' 'page$1.html'}

So now these files were all named ◊code{page01.html} through ◊code{page20.html} but they still ◊emph{contained} links in the old ◊code{?pg=} format. I was able to update these in one fell swoop with a one-liner:

◊blockcode{grep -rl \?pg= . | xargs sed -i -E 's/\?pg=([0-9]+)/page\1.html/g'}

To dissect this a bit:

◊ul{
◊item{◊code{grep -rl \?pg= .} lists all files containing the links I want to change. I pass this list to the next command with the pipe ◊code{|} character.}
◊item{The ◊code{xargs} command takes the list produced by ◊code{grep} and feeds them one by one to the ◊code{sed} command.}
◊item{The ◊code{sed} command has the ◊code{-i} option to edit the files in-place, and the ◊code{-E} option to enable regular expressions. For every file in its list, it uses ◊code{s/\?pg=([0-9]+)/page\1.html/g} as a regex-style search-and-replace pattern. You can learn more about ◊link["https://regex101.com/r/rXuSJB/1"]{the details of this search pattern} if you are new to regular expressions.}
}

OK, digression over.

◊subsection[#:id "back-up-the-cms-and-database"]{Back up the CMS and Database}

Before actually switching, it’s a good idea to freeze-dry a copy of the old site, so to speak, in case you ever needed it again.

Export the database to a plain-text backup:

◊blockcode{mysqldump -u username -pPASSWORD db_name > dbbackup.sql}

Then save a gzip of that ◊code{.sql} file and the whole site directory before proceeding.

◊subsection[#:id "shutting-down-the-cms-and-swapping-in-the-static-files"]{Shutting down the CMS and swapping in the static files}

Final steps:

◊ol{
◊item{Move the HTML files you downloaded and modified above into the site’s public folder.}
◊item{Add redirects or rewrite rules for every page on your site. For example, if your server uses  Apache, you would edit the site’s ◊code{.htaccess} file so that URLs on your site like ◊code{site.com/about/} would be internally ◊link["http://httpd.apache.org/docs/2.0/misc/rewriteguide.html"]{rewritten} as ◊code{site.com/about.html}. This is going to be different depending on what CMS was being used, but ◊strong{essentially you want to be sure that any URL that anyone might have used as a link to your site continues to work}.}
◊item{Delete all CMS-related files from your site’s public folder (you saved that backup, right?) In my case I deleted ◊code{index.php}, ◊code{css.php}, and the whole ◊code{textpattern/} directory.}
}

◊subsection[#:id "after"]{Once you’re done}

Watch your site's logs for 404 errors for a couple of weeks to make sure you didn't miss anything.

What to do now? You could leave your site running where it is. Or, long term, consider having it served from a place like ◊link["https://www.nearlyfreespeech.net"]{NearlyFreeSpeech} for pennies a month.