Writing whisker’s tokeniser using the Theory of Computation

Thumbnail for the tokeniser article for whisker. Illustration shows a finite state machine with three interconnected nodes. Contains the HBD and whisker logos, and the caption: FSM Tokeniser
whisker uses a Finite State Machine for tokenisation

TL;DR

In this article, I explain how the tokeniser for whisker works in simple, broad terms. I draw a correlation to concepts in the Theory of Computation and how I used those techniques to make my job of implementation easier.

The corresponding source code is available on GitHub.

On whisker

whisker is my template engine for V that is inspired by the mustache template language. I’ve outlined the reasons I wanted to make whisker in the first place in my previous blog post. Here are some features that I think set it apart:

  1. whisker templates are logic-less. This means that there are no branching statements like if-else and also no for or while loops. Instead, you have Boolean positive and negative sections that mimic the branching behaviour, and iteration is achieved through lists.
  2. Only four data-types are supported: Booleans, Strings, Maps, and Lists. Maps and lists can be nested. This helps keep the query engine simple enough to be fast, yet does not compromise on capability.
  3. Templates are composable: You can use iteration and partial template recursion to make rich, composable templates.
  4. Convenient data model: The input for the templates can be constructed in V source code, or imported and exported using JSON.
  5. Partial template support: You can isolate common snippets of templates and import them iteratively or recursively in the main input template. This allows for a simpler development experience and allows for varying levels of abstraction.
  6. Safe by default: Just like mustache, default tag interpolation escapes the contents provided. This protects against common HTML tag and script injection attacks and makes it easier to protect end-users.
  7. Customisable: whisker allows the delimiters to be changed from the default double-curly-braces. This prevents the need for character escaping when the template has many curly braces and can cause confusion.

Examples

We can take a look at an examples to see the features in action.

{{=[ ]=}}
{
	"value": [&value]
}

Here we have a JSON (main) input template with the following features:

  1. The delimiter is swapped from {{...}} to [...] using the swap syntax in the first line.
  2. Rest of the template shows a regular JSON body with a key called "value" and it is mapped to whatever is provided by the content of 'value' in the input data model.
  3. We also added an ampersand (&) in front of the tag name to indicate that we don’t want the data to be HTML escaped. This is also possible by using [{value}]. If the delimiter stays the same, triple-braces indicates raw tag interpolation.

When we pass in data in the form of DataModel({'value': DataModel('42')}) (or a JSON equivalent), we get the following output:

{
	"value": 42
}

Note that the tag wasn’t quoted in the template so it isn’t quoted in the output as well.

More examples are shown in the project’s README file. The full specification is available in the spec directory which shows more examples of whisker in action.

The Theory of Computation

Now what does this template have to do with the theory of computation? What exactly is the ToC?

To put is simply, this is where you can apply a bit of mathematics to automate the vast majority of decisions you need to make while writing the part of your code that recognises a ‘language’.

Here, the whisker (and mustache) syntax is a language with a set of pre-defined rules that define what is normal text and what are tags and sections. Luckily for us, the rules are somewhat simple and that allows us to use a simple model to scan and understand a template. That simple model is a Finite-State machine and there is a relatively straightforward correlation between the mathematical representation and concrete code implementation.

In an FSM, there are several discrete ‘states’ that a ‘machine’ can be in. In our case, the parser is the machine and the states correlate to the position when we are scanning regular text, when we might be detecting an opening delimiter, when we are reading tag contents, etc.

These states are interconnected by input conditions. Suppose we’re in the normal text state and the next input character is the first character of the opening delimiter (which is '{' if the delimiters haven’t been swapped yet). Then we move from the normal text state to the scanning for potential opening delimiter state.

We stay here until we have either completely scanned the opening delimiter (which may vary in length) or until we get an expected input character, marking a false alarm and we revert back to the normal state.

We can make a few augmentations (like using memory to keep track of new delimiters and stuff) which make our ‘machine’ slightly more capable that a regular FSM. But we still retain the spirit of the original and use that to build the prototype quicker.

The Algorithm

We can start implementing our modified FSM in V. The source code is on GitHub and we focus on the extract_tokens function.

Briefly, this is how the control flow works:

  1. We go through the input string character by character. We don’t perform look-ahead because we don’t need to. Different temporary string buffers help with conditional outputs.
  2. Input characters are collected normally until we hit the first character of the tag opening delimiter. All data collected until this point is put into a regular token, and the internal state makes the machine ready to accept all the opening delimiter’s characters. It may be possible that we don’t have all the characters of the delimiter which is treated as a false alarm and the state goes back to normal.
  3. In case we do find the opening delimiter successfully, we start scanning the tag contents. Whitespace is ignored until we find either a special character or just the beginning of the tag name.
  4. The delimiter swap directive is handled within the loop with appropriate bounds checking and whitespace handling. It took a couple of tries to dial it in properly.
  5. The different types of tokens are indicated by prefix characters like + for positive Boolean section, - for negative Boolean section, # for maps, * for lists, etc. Mostly straightforward but the raw-tags need some care.
  6. We finish off the tag recognition by scanning the closing delimiter and go back to recognising normal unaffected text.

During the implementation, I had to experiment a little bit to eventually arrive at the current version and it seems fairly robust. If you find any edge case, please file an issue or contact me!

We’re Not Done

Unfortunately, this is not the only pass we need. Before extracting the tokens, we make sure that the indentation in the template is consistent. Tabs or spaces should be used exclusively, not as a mixture. This makes sure that templates are properly formatted.

After we have extracted the tokens, we need to handle cases when the tags are standalone (one line only has indentation and a tag) or inline (line has normal content and one or more tags). We could have dealt with this in the extract_tokens function, but I didn’t feel like complicating the process further. This post-processing step seemed a lot more intuitive.

In retrospect, I could have used a pushdown automaton (PDA) to handle indentations and newlines more elegantly. But I prioritised finishing the project over obsessing on the details and I really wanted to start tackling template program execution.

Wrapping Up

The tokeniser is implemented in tokenizer.v (pardon the American spelling). I spent several weeks in December 2022 fiddling with the code but I’m happy with how it turned out. I hope that this version withstands scrutiny until we find some critical fault which mandates a rewrite of the tokeniser.

Until then, this is how I put my CS lessons and math background to use in practical projects. Thank you for reading this far and I hope you enjoy using whisker! You can read about the announcement post as well.

In a future article, I’ll write about how I used special Linked Lists to represent and execute whisker’s template programs in my next blog post.

Leave a comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.