17 JUN 2011

Problem-solving with Automata-Based Programming

Recently when working with CSS, I wanted to automate some deployment-related tasks: I needed to combine a number of CSS files into a single file; fix image paths from production to deployment values; and strip out the contents from the resultant file. So I wrote a script that I now call each time I move from development to deployment, such when introducing modifications or adding pages to a site.

While working on the de-commenting part of the script (which was the meat of the task) I researched a style of programming called Automata-Based Programming. I liked this style so much that I'll discuss its characteristics and merits here.

I wanted to achieve the de-commenting in a single pass, producing an output CSS file without modifying my input file. (I also didn't want to resort to regular expressions immediately, wanting to develop a more fine-grained, character-based approach. My assumption was that a file should be processed as a stream rather than a large string).

At first, I pondered the naive, character-for-character procedural approach, something along the lines of, in pseudocode:

inside_comment = false
  while not end_of_file do
    one = getchar
    two = getchar  
    if not inside_comment and one =='/' and two =='*'
      inside_comment = true
    else
      if not inside_comment print one, two
    if inside_comment
      if one =='*' and two =='/'
        inside_comment = false
      else
        one = getchar
        two = getchar
        if one =='*' and two =='/'
          inside_comment = false

This was bound to build up to a crescendo of nested "if"s and "while"s. There is massive duplication with calls to getchar, as well as referencing and setting inside_comment. More importantly, this code builds on the assumption that the number of characters preceding the start of a comment is divisible by two. If this number is not even, the solution would fail and it would take additional conditional branching to make it work. All in all, a brittle, hardly readable solution open to not-so-subtle logical flaws.

Next, I considered using a string library to help out with pattern-matching. I had seen the StringScanner class be used for parsing JSON, so I considered going the same route and use StringScanner#scan to move through the string finding comments. However all that seemed to be was using a library to find occurrences of / and /. It was all about these two combinations of these two characters. No nested structures, no recursion, no well-formed language constructs. No grammar to validate against. This realization made relying on a library look rather redundant.

So I decided to go with the most raw approach - processing the CSS character-by-character. I visualized an auxiliary data structure as an intermediate destination for the characters on the way from the source file to the target file. This data structure would act like a valve, letting characters through when there was no comment and closing when there was. When the end of the comment was detected, the data structure would be cleared (the comment deleted) and the process would resume until end of file was reached.

In essence, this would be a queue with some helper methods for analyzing the top two characters. The Ruby Array class, with it's #shift method, provided a foundation for that data structure.

What this allowed me to do is to create a linear structure for my program with only one while loop (the outermost loop that reads in characters as long as there are more). This is one of the characteristics of Automata-Based Programming: only the outermost loop is needed and it is used to read in input from start to end.

With each step of the program (which adds a character to the data structure) I check for a presence of a comment based on what the top 2 characters on the queue are (the values I'm interested in are / and /). An instance variable @comment is then modified accordingly.

def process(char)
    add(char)
    @comment = true if comment_start
    @comment = false if comment_end
    c = filter
    c unless c.nil?
  end

Based on a subset of all possible values of these three factors (commentstart, commentend, and @comment), one of the following three steps is executed:

Dequeue the first character to the output (we have just entered a comment);
Clear the stack (we have reached the end of a comment);
Dequeue the first character to the output (no comment in sight).

Here, we have another characteristic of Automata-Based Programming: with each step of the program (here, it is adding a character to the queue) we query and / or modify the the program's "state". ("State" is the value of a set of variables at a particular step of the program). When the program continues, it encounters a set of directives. From this set, a directive will be executed depending on whether or not it satisfies a this state. This has the effect of "flattening out" the program's logical branches to several conditional statements that often all appear at the same level, almost assembly language-like:

def filter
    return get_first if ready and comment_start #["c", "/", "*"] - pop "c"
    clear if comment_end #["/", "*", "a", "b", "c", "*", "/"]
    get_first if ready and !comment #["a", "b", "c"] or ["b", "c", "/"]
  end

(Here is the complete program on GitHub).

I find that Automata-Based Programming is very useful, because it allows me to create a program around logical structures. It makes for programs that are easy to reason about, giving me solid logical anchors to what is happening at each step. This can be seen if we consider the following statements about our helper data structure:

A preconition for querying the state of the queue: The queue must contain at least two characters.
A precodition for removing the first character of the queue: The queue must contain at least three characters.
An invariant for the queue: When we are not inside a comment the queue is never longer than 3 characters.

(Specifically, as soon as I had the formulation of the above invariant and the visualization of the queue as a valve was I knew I had the solution to the problem; all that remained was to write the program).

Programs created in this style are easy to read, a pleasure to write, and are robust in structure, as they build on a formal logical model (A Finite State Machine, or FSM). ABP-style programs allow one to program for correctness and therefore, testability. Beyond clean code, there are other reasons to be proficient in this coding paradigm. Let's list a few of its applications:

Advanced text processing. The example in this article is very simple, but Finite State Machines are effective at processing text input of varying complexity and structure (markup, programming languages).

The logic behind Graphical User Interfaces. FSMs are very suited for modeling the interaction of a user with an interface as the components of such an interaction can be expressed as States, Events, and Transitions. For a modern example, consider Alex MacCaw's Super.js - a modular, jQuery-based JavaScript library for building RIAs.

Moving Object-Oriented programs towards increased explicitness, transparency, and stability. OO programming builds on the notion of objects that possess state at runtime. OO state is less rigorously defined than FSM state, and this has previously led some to criticize the OO paradigm. In large OO programs, state can become complex, elusive, and hardly verifiable. We can use the Automata-Based style of programming to create Object-Oriented systems that exhibit more explicit and predictable behavior.

Some resources on state-based computation and Automata-Based Programming:

The Wikipedia has a fantastic article with an example of how to transform a character - processing program from imperative style into ABP style: Automata-Based Programming.

Jon Bentley in Programming Pearls describes the evolution of a program to find the maximal-value subarray in an array of integers. The final solution shows how to combine scanning and state-saving to achieve a very short and fast program. Although it doesn't have explicit State classes or transitions, it works by querying and (re)setting state at each step; I like to think of it as the primordial ABP-style program.