Why Python over Lua over shell scripts to build my blog

The best way I found to completely understand how to use a static site generator was to build my own. I now understand at least one!

Published: 2023-10-22

It's very common for the first post in a technical blog to be about how the site was built. I'm not making an exception here; this is an explanation of my many attempts at making a static site generator (and why I ended up using Python).

A little history

Before I started working on building my own static site generator, but after I first got this domain name, I already had some kind of Fossil repositories, so I used one as a home page for a while.

Fossil is nice and all, but since it's mainly a version control system, there are a bunch of pages in the web interface that are not necessary for a personal website (help pages, logins, etc.), and some features are missing (like sitemap.xml generation).

Even if Fossil's source code is relatively easy to extend, removing parts of the web interface was a big enough change that it made me consider "easier" ways instead.

I figured that my preferred way would involve building a static site from Markdown files and some kind of script.

The POSIX attempt

My initial attempt used a POSIX shell script that called Pandoc with some custom Lua filters.

At first, the script was simple enough, but when I got around to generate the blog's Atom feed and the page which lists the blog posts (which both share lots of information), I really felt like I needed arrays and/or hash tables, neither of which can be straightforwardly done in POSIX shell scripts (and no, a big string of many substrings separated by rare control characters doesn't count).

At that point, I probably should simply have assumed that I could use all of bash's features instead of limiting my script to POSIX compatibility. But no, I made a POSIX shell script, so that meant a lot of calls to sed and tr that could have been avoided with some bash variable expansion magic.

Anyway, I was ready to rewrite it all in another scripting language. I still wanted to use some kind of "small" programming language to minimize the build dependencies of my site.

Lua, simpler?

I was already using Lua filters with Pandoc, so it wasn't much of a leap to also use Lua for the rest of the build script.

So I rewrote the script in Lua, including some kind of custom command-line arguments parser (because Lua doesn't have something like getopts built-in).

Apart from args parsing, the Lua script was much cleaner than its shell script counterpart. I could finally use lists and tables instead of putting everything in strings! And I could even include the build script as a module to access my helper functions in the Lua filters passed to Pandoc! (which strangely did not add measurable overhead)

Pandoc troubles

Pandoc is very good software and is extremely useful.

But Pandoc is also a bit slow to run (20 ms on my machine¹, compared to 2 ms for lowdown), and this compounds with the fact that one Pandoc call can only process a single file. This means the minimum time it would take to run Pandoc on only 5 files would be 100 ms! And this excludes the additional time it takes when Pandoc has to do syntax highlighting (around 300 ms per file)!

So I made some one-liner abomination to run Pandoc in parallel to lessen the effect of Pandoc's relatively slow run time.

I hope you like scrolling sideways, because this line is waaaaaay too long (and somewhat of an unmaintainable mess) :

os.execute("find "..escape(dir).." -maxdepth 1 -iname "..escape(glob)..(exclude_pat and " '!' -iname "..escape(exclude_pat) or "").." -print0 | sed -z -E -e "..escape([[s,^(.*/)?(\.?[^./]*?)\.[^/]*$,\2,]]).." | xargs -0 -P 4 -I '{}'"..(verbose and " -t" or "").." pandoc"..(pandoc_args and " "..table.concat(escaped_pandoc_args, " ") or "").." "..escape("--output="..out.."/{}."..out_ext:gsub("^%.","")).." "..escape(dir.."/{}."..ext))

But it somehow felt necessary for performance.

Syntax highlighting

The syntax highlighting of Pandoc also is not ideal; it's a bit slow and it does not support all the syntaxes I want to use.

While it can highlight the syntax of most programming languages, (at the time of writing) it cannot highlight shell sessions out-of-the-box²!

$ echo "Hello world"
Hello world

See? This works because I'm using Pygments.

The standalone pygmentize tool also has a slow startup speed, but at least it supports all the syntaxes I'm interested in using.

I've looked at other syntax highlighters before settling on Pygments. One of them was chroma, but its command-line tool had no way to safely get a language name without also possibly interpreting it as a file name³.

At one point I even considered not using syntax highlighting at all for my website to make the build go faster.

Pandoc Templates

A cool thing with Pandoc is that it has its own relatively simple syntax for templates⁴. It seemed ideal for a static site generator!

But then I realized that the default Pandoc templates (which I was (at the time) using as a base for my own custom templates) are not in the public domain; they are licensed under GPLv2+ and BSD3 (see https://github.com/jgm/pandoc-templates).

For some reason, I wanted to be able to use the license of my choosing on the source of my website. And I didn't want to have to include a copy of the BSD3 license of another project⁵ only for the templates.

That meant I had to start over (again) and build my own templates from scratch and avoid Pandoc and all of its niceties like Lua filters.

No Pandoc?

But I still wanted to run filters on the HTMLized Markdown of my site!

Unfortunately (and predictably), Lua can't parse HTML without either using a library for XML (like luaexpat) or parsing it on my own.

Since I no longer wanted to use Pandoc for my site, Lua stopped having the advantage of already being a dependency, so this led me to search for a programming language with a bigger standard library.

Python!

I never really used Python seriously like this before.

It's kind of enlightening that something I thought was so awful works so well.

Slow startup

What makes pygmentize start up slowly is Python's own startup speed (23 ms minimum on my machine). For comparison, both Lua and Perl start in 3 ms on the same machine¹.

$ time python3 -c 'print("Hello world")'
Hello world

real    0m0.023s
user    0m0.018s
sys     0m0.004s

$ time lua -e 'print("Hello world")'
Hello world

real    0m0.003s
user    0m0.001s
sys     0m0.002s

$ time perl -e 'print("Hello world\n")'
Hello world

real    0m0.003s
user    0m0.001s
sys     0m0.002s

And that's even after a few previous runs.

A solution is to use Pygments directly from a Python script. That way, the Python interpreter is only started once, even if thousands of snippets need highlighting.

Perhaps paradoxically, Python's slow startup speed can be an incentive to use Python, since anything else calling Python libraries is slower (unless you're using C or C++ and want to embed the Python interpreter in your program).

Remember, I want to use Pygments because it seems well-maintained and one of the best server-side syntax highlighter I found which can highlight shell sessions.

Args parsing

I was very pleasantly surprised to find out about the argparse module in Python's standard library.

It even makes the --help from the descriptions of each sub-commands and flags!

It's much more concise than manually writing and structuring the --help output in shell scripts, and much easier than manually parsing the command-line arguments like I did in Lua.

Filters

Python's standard library natively supports XML⁶!

But HTML is not XML.

The solution I'm happy with for now is to build XHTML pages (which are valid XML) and then run what I'm calling "XML filters" on those pages. This makes it "easy" to make every internal link relative and to make them use the final location of their destination. It also makes it relatively simple to clean up what is included in the Atom feed of the blog.

For example, my posts include syntax highlighting for code snippets, but the spans and classes inserted by Pygments are useless in a feed since styling is mostly ignored by feed readers. So I made filters to remove all class attributes and all span elements for the included XHTML in the Atom feed.

I'm not sure if there are other ways to do all of this, but I'm pretty happy with the versatility of XML.

Type checking

I vaguely knew that type annotations were supported by Python, so I used them to make some of the function definitions clearer.

But type annotations are ignored by the Python interpreter, so I looked for a static type checker. I found out about Pyright, and it's impressively useful once set up correctly.

Static type checking in a scripting language is so nice, and it hugely helps with refactoring (which I've heard a lot, but which I first really experienced with this script because even after 2 rewrites, I still had many other ideas of things to change).

You're looking at the result

This is my site, and it's built from this Python script.

All the programming languages I considered for my build script are (interpreted) scripting languages. Why? Because it felt simpler to avoid making a build script for my build script.

The build script is evolving along with the site, so that means it changes as much or as little as I want, and only when I want it to change (unlike with static site generators like Hugo). And no need to avoid breaking changes, since I'm (or should be) the only user of my build script!

Hopefully, I'll be writing more blog posts instead of shaving the yak.

For the record, my main machine is not that powerful compared to what's available nowadays. It's a low-power laptop whose processor is an Intel Core m3-8100Y. ↩↩
It's still possible to use custom syntax definitions to highlight shell sessions in Pandoc, see https://github.com/jgm/skylighting/issues/67 ↩
The only option of the CLI of chroma that can set the language to be highlighted (as desirable for snippets on a website) is --lexer (or -l). See the description of chroma's lexer flag. ↩
See https://pandoc.org/MANUAL.html#template-syntax. ↩
And that's even when assuming the dual licensing of pandoc-templates means either license can be used, but this is probably wrong since that project's README says it's under both licenses, so that would mean the safest way to use variations of the Pandoc templates would be to license everything (including my build script) under the GPLv2+ license, while also keeping the BSD3 license of pandoc-templates. Too many licenses to think about (and ways to make mistakes) in this case, so I'm glad I started from scratch to avoid being burdened by licenses I did not choose (even though I might end up using the GPL anyway). ↩
I was very happy when I found out XML parsing and tree building are built into Python. https://docs.python.org/3/library/xml.html ↩