Monday, April 27, 2009

Tricky state machines

You have a programmable logic design, and you have tested it on the bench. Everything works. You have run functional simulation backwards and forwards. Then you re-spin the design to change or add something, and...something locks up. A part of the circuit no longer operates. Or a subblock that you did not touch starts giving sporadic malfunctions. Sometimes this malfunction is hard to recreate. It only shows up after a lot of testing. What is going on?

There is a good chance you are having issues with synchronization. You either have an asynchronous signal going into your synchronous logic, or a signal from one clock domain going into logic in another clock domain. Most designers are somewhat aware of these problems, but in large, complex design it is easy to overlook things or take shortcuts and assume that things will work.

Finite state machines are very susceptible to this kind of problem. At each active clock edge, the inputs to the flip-flops that comprise the state machine will be sampled to determine the next state. The inputs must be valid before the setup-hold window around this clock edge. When you run the place-and-route tool, it will run timing analysis to insure that this requirement is met, but normally it will do this only for signals in this clock domain.

It is easy to overlook an asynchronous input, or a signal from another clock domain which is essentially asynchronous to the clock that runs the state machine. There is no guaranteed timing relationship between this input and the setup-hold window of the clock. Input signals can change state right in this setup-hold window. If this signal goes to the input logic of more than one flip-flop, the data path delay is likely to be slightly different for each flip-flop, and the new input state will be latched by some flip-flops on the current clock edge, but not others. Depending on how the state machine is coded, you could throw it either into an unexpected but valid state, or into an undefined state (especially if it is one-hot encoded.) This will cause the state machine to malfunction or hang.

By the way, a counter is a state machine. It might be easy to take an asynchronous signal and use it as a "synchronous" clear signal that only has to reset the counter to zero. The thought is that the counter reset will be active for a few clocks, and even if the signal initially only gets to some of the registers, it will get to all of them on the next clock. However, this reset is going to be released at some point, and if it de-asserts on a clock edge, you can get unexpected results because the clear signal is combined with counting logic that is trying to increment the counter when not in reset.

Even a single flip-flop can show unexpected results. If the input consists of logic that can either set or reset the flip-flop, and the "set" logic is synchronized properly, but the "reset" logic is asynchronous, you can still have problems. I have seen that even when the "set" logic remains inactive, and the aynchronous "reset" logic activates, the flip-flop can be set. From an RTL point of view, this makes no sense. But at the gate level, something else happens that produces the opposite of the expected result. The only safe option is to synchronize all signals into the correct clock domain.