Aug 1, 2008

RegExs and Finite State Automata

In my regular browses through the Internet, I stumble across comments, flame wars, etc. Through doing this, I've realized two things: lots of computer-y people don't know about regular expressions (a highly useful tool) and a lot of them don't seem to know what computer science is (having taken comp sci in school, it slightly bothers me when people call something comp sci when it isn't actually comp sci). So I'm going to write about an actual comp sci topic in hopes that at some point I may help educate people. Also I'm writing this because I like automata theory and want to talk about it.

Regular expressions (regexps) are a part of computer science, in the sub-discipline of automata theory. You can probably put them into other sub-disciplines, but this is one where they are used a lot.
So what is a regexp? Many of us use them, or have heard of them. Well, let's compare to math. An arithmetic expression is a list of primitives (like numbers or variables) and possibly operators (possibly, since "5" is technically an expression) including + and *. They are given an input (usually a set of values that correspond to any variables used in the expression) and evaluate to a certain value.
A regexp is similar to an arithmetic expression. The difference is that a regexp doesn't really evaluate to a value as an arithmetic expression does, rather the regexp accepts or rejects the input (you could say it evaluates to true or false).
So what does a regexp consist of? There is an alphabet Σ (alphabets are usually referred to by the Greek letter Σ, pronounced "sig-ma") over which the regexp works. This can be as simple as {a, b}, or more complicated, like Unicode. For simplicity, we will use small alphabets like {a, b}.
Then there are operations. There are three operations with regexps:

Concatenation: /ab/ (I will use / / to represent a regexp). The input is accepted if it is an a, followed by a b.
Alternation: /a|b/ The input is accepted if it is an a or a b.
Kleene Star: /a*/ The input is accepted if it is zero or more as.

We also need to introduce ε (Greek letter epsilon) which means "nothing".

Those familiar with Perl may say, "Hey, there are lots more operators than that!" True. There are lots more than that1. However, these are the fundamental base operators. Like in arithmetic, there are the fundamental operations of * and + from which all other operations can be created2 (like subtraction, exponentiation). All other operations can be made from combinations of these three, and none of these three can be made up from combinations of the other two (I'm pretty darn certain - you can say the Kleene Star is a combination of concatenation and alternation, but regexps have to be finite and Kleene Star is the only way to represent infinite inputs using a finite regexp).
Let's take a look at some others:
+ (one or more): /a+/ is the same as /aa*/
? (zero or one): /a?/ is the same as /(ε|a)/
and you can figure out others.

So these are pretty cool, but we still only have them at a theoretical level. How would we put these into practice? First, we need a bit more knowledge.

Now we move onto the finite state automaton or FSA (also known as a finite state machine, I just think "automaton" sounds so much cooler). These are "machines" that work over an input, and either accept or reject it.
What are these made up of? Well, we still have our alphabet Σ. We also have a finite set of states, Q. Then a start state s which is in Q, and a set of final states F which is a subset of Q. Finally, there is a transition function δ (Greek letter delta - us comp sci kids seem to love the Greeks) which is a function that takes two inputs: a state and a character, and outputs a state3.
The FSA works by starting in the first state s, and then going in steps over the input, consuming one as it goes, and doing a state transition for each one. The FSA moves to a different state depending on what δ outputs. Then once the input is all consumed, it is rejected if the state we are in is not in the set of final states F, otherwise it is accepted.
It's really easy to write a program to simulate a FSA. Here is a simple algorithm:
function doFSA(input, s, d, F)
set q = s

while input is not empty
a = pop first element from input
set q = d(a, q) // this is the state transition

if q is in F
accept
else
reject
Pretty simple eh?

So, how does this relate to regexps? Well, in automata theory there is a proof (that I don't feel like writing out here) that any regular expression can be converted to a finite automaton, and any finite automaton can be converted to a regular expression. That gives us a pretty easy way to simulate our regular expressions!

I've skipped a lot of details here. One that I want to talk about a little bit is non-deterministic FSAs.
A non-deterministic FSA allows two new things. The first is a ε-transition, where the FSA can move from one state to another without consuming an input. This means it can also move back. So you can look at this in different ways: the FSA is in both states; or the FSA's state is indeterminate, ie. you don't know which state it is in.
The second new thing is that FSA's allow for you to end up in different states for the same input from one state. If you're in q2, you could consume an a and end up in q3 or in q4 since the transition relation (in non-deterministic FSAs it is a relation rather than a function) can allow multiple outputs for a specific set of inputs.
It doesn't take much to realize this makes simulating a FSA much harder. Fortunately for us, there is an algorithm on how to convert from a non-deterministic FSA to a deterministic FSA.

The proof that any regexp can be expressed as a FSA involves showing that you can have an FSA that accepts a one-character regexp, and then showing that you can combine these FSAs into larger FSAs that represent the three basic operations of regexps. This creates a non-deterministic FSA, that then can be converted into a deterministic one.

So how is this useful? How does this relate to programming or Ubuntu? It doesn't really. But I think it's good to know how things work, or where our tools come from.

1 Perl regexps are not true regexps in that they let you use forward matching and that's not something that regexps can do. Second, if you know Perl regexps, I use an implicit ^ and $ in my regexps.
2 See here for some info on why multiplication is not just repeated addition.
3 This is the case for deterministic FSAs, non-deterministic FSAs use a transition relation Δ over Q x (Σ U ε) x Q (that's a Cartesian product).

2 comments:

Daniel said...

You do realize that Perl's regular expression, and, in fact, even Unix old, original regular expression libraries, are not truly regular expressions?

They are actually automata with stacks, because you can forward match based on matches you have done previously.

They also suck for speed when used this way, of course.

Rob Britton said...

Yes, thanks for reminding me. I was planning on saying that in the original article but I forgot while I was writing. I'll fix that now.