A friend of mine recently covered deterministic and non-deterministic finite state machines (automata for the cool kids) in one of his classes and passed along a sample problem for me to figure out. Informally, build a dfa that will match a given string against another string in the provided alphabet. An example of this class of problem would be to build one that matches 'aab' in any string for the alphabet [ab] ('aaba' would match, 'aba' would not). To help with visualizing the answer for this example I've built a graph of the states and edges that make up our dfa (yay graphviz): graph At each state the machine follows the edge that matches the next input from a given string. If it ever hits the fourth state, it's found a match. Checking the string "baab" for "aab" (for which our machine is built), starting at state one it would take the path 1 -b-> 1 -a-> 2 -a-> 3 -b-> 4, and conclude that "aab" was in fact contained within "baab". After doing this on paper I thought it might be fun to use haskell and put together a function that took a dfa, a string to match, and produced a boolean result. The first step I took was to define a data type that represented the dfa.
type Dfa = [State]
data State =  State { identifier :: Int , edges :: [Edge] }
           | FinalState { identifier :: Int, edges :: [Edge] }
data Edge = Edge { action :: (Char -> Bool), edge_identifier :: Int }
This is an informal representation (you can find the formal one here) but here, a dfa is a set of States, a State is an identifier and a set of Edges, and an Edge is a function to determine if it is the edge to follow for a given input and its target state. If this looks a bit clunky at first a cleaned up version is further down. With my types in place I went ahead with building the functions to traverse a DFA for a given input:

match :: State -> Dfa -> String -> Bool
match current_state _ [] = is_final current_state
match current_state dfa (x:xs) = match next_state dfa xs
                           where next_state = traverse dfa current_state x

traverse :: Dfa -> State -> Char -> State
traverse dfa state = (find_state dfa) . (next_id $ edges state)

is_final (FinalState x y) = True
is_final (State x y) = False

find_state :: Dfa -> Int -> State
find_state (st:sts) state_id  | identifier st == state_id = st
                              | otherwise = find_state sts state_id

next_id :: [Edge] -> Char -> Int
next_id (e:es) input | (action e) input = edge_identifier e
                     | otherwise = next_id es input
The short explanation is that for each input it checks the current state for where it should go next, having determined the destination state's id it finds that state in the dfa and then proceeds with the rest of the input. Now, to test a sample graph:
state1 = State 1 [ Edge is_a 2, Edge is_b 1]
state2 = State 2 [ Edge is_a 3, Edge is_b 1]
state3 = State 3 [ Edge is_a 3, Edge is_b 4]
state4 = FinalState 4 [ Edge is_a 4, Edge is_b 4]

graph = [state1, state2, state3, state4]

is_a = (== 'a')

is_b = (== 'b') 
You'll note that state4 refers to itself no matter the input so that the match function can operate exhaustively on the input string. And it works! But the best part of this whole exercise (for me at least) was yet to be realized because this whole first definition has been done in a very inelegant manner by failing to make use of haskell's lazy nature and recursive data types.


The first, and easiest, thing to improve is how the Edges reference their destination states. Before, an integer was defined corresponding the identifier of the target state, but in haskell types can be defined recursively:
type Dfa = [State]
data State =  State { edges :: [Edge] }
           | FinalState
data Edge = Edge { condition :: (Char -> Bool), destination :: State }

For reference:
type Dfa = [State]
data State =  State { identifier :: Int , edges :: [Edge] }
           | FinalState { identifier :: Int, edges :: [Edge] }
data Edge = Edge { action :: (Char -> Bool), edge_identifier :: Int }
Not only is the new definition much cleaner, but its also more intuitive. An edge simply hands off the _actual_ state its pointing at. No need to go find it. As a result the code necessary to operate the state machine is also significantly simplified:

match :: State -> String -> Bool
match FinalState _ = True
match (State _) [] = False
match current_state (x:xs) = match next_state xs
    where next_state = traverse (edges current_state) x

traverse :: [Edge] -> Char -> State
traverse (e:es) input | (condition e) input = destination e
                      | otherwise = traverse es input
Notice the type signature of traverse is very intuitive:
traverse :: [Edge] -> Char -> State 
Given a list of edges and an input it will give you the resulting state. Not the id, which would then be used to search for its corresponding state. Finally, the actual graph:
state1 = State [ Edge is_a state2, Edge is_b state1]
state2 = State [ Edge is_a state3, Edge is_b state1]
state3 = State [ Edge is_a state3, Edge is_b state4]
state4 = FinalState

is_a = (== 'a')

is_b = (== 'b')


By leveraging the recursive reference to another state in the destination field of the Edge type the code shrinks by 3 functions and an enormous amount of complexity. In addition the ability to exhaust the input string or return early on a match with the FinalState constructor allows for a much more readable definition of the match function and the graph itself (no more self referencing weirdness). I had a lot of fun thinking this through and I would be excited to see some further refinements in the comments from those who know haskell!


08 Nov 2009