Sunday, February 12, 2017

Knuth Morris Pratt table construction.

KMP algorithm for pattern matching relies on the partial match table. Construction of partial match table is the bulk of work involved in KMP algorithm and could be very daunting to understand. Below, I have tried to document my understanding of the partial match table construction. 

One of the easier ways to construct a match table for a pattern is to visualize the pattern as a finite state machine(FSM). A finite machine represents a mathematical model of computation. The machine can be in any one of the possible states at any point of time. The machine has a predefined states and based on the input the machines ends up in one of the states. If and when the machine reaches the end state, we have found the pattern we are looking for. FSM visualization should be intuitive from the below discussion and no prior knowledge is required.

Lets construct an FSM for the pattern ABABAC. A FSM has a start state (0 in this case) and an end state (6 in this case). Each character in the pattern takes the machine from one state to the next. 



When the FSM is in state 0, an input of A takes it to state 1. In state 1, an input of B takes the FSM to state 2 and so on. 
Now, at each state, we have to account for the possibility of a non-matching input character. A non-matching input character will still leave the machine in some state.



How did we account for the non-matching input characters at each state ? The idea behind using a FSM for pattern matching is that we don't always need to back up to the start of the pattern if there are some characters of pattern that have matched with the input. So we back up to the point where we have a match and start comparison again. When we are in state#0 and the next character is a B or C, we stay in the same state as highlighted in red above. When we are in state 1 and the next input character is a A, the stream of characters processed is AA. But we were expecting AB to move to state 2. Since there's a mismatch, we have to ignore the first character of the string AA and process the remaining characters hoping to find a match with the subsequent characters. So we are left with a A and this input to our FSM again leaves us in state 1 as highlighted in green above. If the input were AC, again there's a mismatch and we ignore the first char and we are left with C. If C is provided as an input to state 0, the FSM remains in state 0. So at every state, if there's a mismatch due to next character of input, we remove the first character of the input,  provide the rest of the string as an input to the FSM and start from state 0.

Lets apply this procedure to a couple more states. Refer to the red lines above. At state 2, we have already processed the string AB. Another 'A' would take the FSM to state 3. 'B' would lead to a mismatch and we have to ignore the first character of the input processed until now. So our input stream is effectively ABB - A = BB. Now process BB using the FSM machine. Input of BB would leave the FSM in state 0 and hence the path(broken red line) from state 2 to state 0 with a B as the label. Same is the case with an input of C to state 2.(Refer to the green lines above) The stream of characters would be ABC.On removing the first character and processing the string BC through the FSM, would lead the machine to state 0. Hence the path(broken green line) from state 2 to state 0.

At state 5(figure above), we have already processed the characters, ABABA. 'C' would take the FSM to end state. 'A' would mean an input stream of ABABAA. Since there's a mismatch, remove the first char and we are left with BABAA. When this is processed, the FSM ends up in state 1 as highlighted in red below. So state 5 on a A leads back to state 1 as highlighted by blue broken arrow below.

Similarly for state 5, an input of B would mean a stream of ABABAB. Since the input is not a 'C' we need to figure out where to go next. Discard the first character of the stream and we have BABAB. Process this string through our FSM and we end up in state 4 as highlighted in red below. So state 5 on B leads back to state 1 as highlighted by blue broken arrow below.
Below is the final FSM.


No comments:

Post a Comment