Wednesday, November 17, 2010

Interesting reset problem

I happened across this description of a board that failed to boot up properly, just as we were dealing with a similar issue at work. You never know where debugging is going to lead you.

We are ramping up for production of a certain controller board, and have built many prototypes and pre-production boards. Suddenly the factory reported a batch of boards that sometimes get stuck in reset, with different frequencies of occurrence depending on seemingly random factors--for example, if a certain place on the board is probed, or a certain daughtercard is present. We have not recently changed the design and have not seen this problem before.

The first thing we checked is that the board's power rails are coming up in the expected sequence and are stable by the time we start releasing reset. No problems there.

Next, we looked at the glue logic that controls the reset. After some probing and trying different things, we ruled out problems in the glue logic. Could this be some new, undiscovered errata or a bad lot of microcontrollers? The problem only occurs once after power-up. If we can get out of reset, any following resets have no problems whatsoever.

The next step was to think about what the controller is doing right at startup, before it releases reset. We have it set to look in the boot ROM at address 0 to retrieve a configuration word (CW) that further configures the controller before it's up and running. Normally it will find a valid preamble and read a bunch of words from the ROM. But in the failing cases, we only saw it read two words and then stop. Is it reading bad data?

Time to break out the logic analyzer and look at the bus. Interestingly, the probes seem to load the bus in such a way that we can no longer produce the failure! Now which analyzer pods can we take away to get the failure again? It turns out that the pod with the address lines makes the board pass when present, and sometimes fail when removed.

With the analyzer on the bus, we could see that the first word out of the ROM is correct--part of the preamble--in the passing case, but some other value in the failing case. But the address and control signals all look correct. We checked the reset pulse for the ROM and did some adjustments. After all, if the ROM is somehow coming up in a funky state, a good result pulse should take care of it, right? But no improvement there.

Finally, we contacted the ROM vendor and found out a newly discovered errata. Sometimes it really is someone else's fault! There is a known issue where the first read access after power-up, if it is to address 0, outputs the manufacturer code instead of the data. We happened to build with some parts that were susceptible to this problem.

Now we can work around this, and should be good to go. We changed the glue logic to force another reset if the controller is stuck, which is a good robustness improvement anyway. Now we just need to get the revised silicon from the vendor.