Wednesday, November 17, 2010

Interesting reset problem

I happened across this description of a board that failed to boot up properly, just as we were dealing with a similar issue at work. You never know where debugging is going to lead you.

We are ramping up for production of a certain controller board, and have built many prototypes and pre-production boards. Suddenly the factory reported a batch of boards that sometimes get stuck in reset, with different frequencies of occurrence depending on seemingly random factors--for example, if a certain place on the board is probed, or a certain daughtercard is present. We have not recently changed the design and have not seen this problem before.

The first thing we checked is that the board's power rails are coming up in the expected sequence and are stable by the time we start releasing reset. No problems there.

Next, we looked at the glue logic that controls the reset. After some probing and trying different things, we ruled out problems in the glue logic. Could this be some new, undiscovered errata or a bad lot of microcontrollers? The problem only occurs once after power-up. If we can get out of reset, any following resets have no problems whatsoever.

The next step was to think about what the controller is doing right at startup, before it releases reset. We have it set to look in the boot ROM at address 0 to retrieve a configuration word (CW) that further configures the controller before it's up and running. Normally it will find a valid preamble and read a bunch of words from the ROM. But in the failing cases, we only saw it read two words and then stop. Is it reading bad data?

Time to break out the logic analyzer and look at the bus. Interestingly, the probes seem to load the bus in such a way that we can no longer produce the failure! Now which analyzer pods can we take away to get the failure again? It turns out that the pod with the address lines makes the board pass when present, and sometimes fail when removed.

With the analyzer on the bus, we could see that the first word out of the ROM is correct--part of the preamble--in the passing case, but some other value in the failing case. But the address and control signals all look correct. We checked the reset pulse for the ROM and did some adjustments. After all, if the ROM is somehow coming up in a funky state, a good result pulse should take care of it, right? But no improvement there.

Finally, we contacted the ROM vendor and found out a newly discovered errata. Sometimes it really is someone else's fault! There is a known issue where the first read access after power-up, if it is to address 0, outputs the manufacturer code instead of the data. We happened to build with some parts that were susceptible to this problem.

Now we can work around this, and should be good to go. We changed the glue logic to force another reset if the controller is stuck, which is a good robustness improvement anyway. Now we just need to get the revised silicon from the vendor.

Wednesday, July 28, 2010

Tuesday, June 29, 2010

$4 Development Kit

I was tipped off by someone from work that TI has this little development kit available for the low, low price of $4.30. But so many people have jumped on this, that it's backordered.

It would be nice to be able to get my hands on this for tinkering at home. You can't beat the price, and the development kit is included. I'm assuming it's powered right off the USB, so no external power supply is needed.

Story is here:
http://ee.cleversoul.com/news/tis-amazing-430-launchpad.html/

Friday, May 21, 2010

Digital Feedback Control

Linear feedback systems are well known. In the continuous domain, the Laplace transform can be used to characterize the system and find an appropriate compensator to give the desired response. But in a digital world, this can break down. If the output of the system is sampled at discrete time intervals, the time lag can cause system instability that might not be predicted with the Laplace transform.

With some simple continuous feedback systems, the gain of the feedback control can be increased to improve the response time. There might not be any theoretical limit to how high the gain can go. But in a discrete time system, the time lag can cause oscillations and instability if the gain is raised beyond some limit.

As a simple example, suppose you have an integral feedback controller. This type of feedback accumulates error and uses this to drive the thing being controlled. It can be a little slower, but it is simpler than a full-fledged PID controller. It is very stable -- it tends to asymptotically approach the set point for all gains, and it tends towards zero steady-state error, which might be important in some systems. A write-up of that kind of system is here [pdf].

In the continuous domain, the open loop gain is just the integrator, 1/s, times the feedback gain K. In closed-loop form, assuming unity feedback, the transfer function comes out to K / (K + s). As you increase K, you increase the speed of the response.

In the discrete domain, you can simply do this in a tabular fashion, as if you had a digital controller that samples the output, compares against the set point, and accumulates the error. For a small value of K, say 0.2, you get a nice exponential decay towards the set point.

Click to enlarge.

If K gets larger, you will start to see ringing but the system is still stable.


Finally, increasing K by too much will make the system unstable.


There are analytical ways of figuring the response of a discrete time system. That is for another post.

Wednesday, May 12, 2010

Another Orbit Simulator

This is one I came across while researching orbital mechanics, and was looking to see what was out there for simulators. It's pretty impressive, and it's free!

http://orbit.medphys.ucl.ac.uk/

Basically, it's a first person point of view in different types of vehicles and missions. It can get pretty involved, though. Make sure you have lots of free time.

Sunday, April 25, 2010

Orbit simulator

Reading Neal Stephenson's Anathem got me interested in orbital mechanics. It's a little non-intuitive. If you are trying to dock with a space station, and it's ahead of you in orbit, you need to slow down (fire thrusters opposite the direction you are going, or retrograde.) This puts you in a slightly lower orbit which has a shorter period, so you speed up your angular velocity. Kind of like running on the inside lane around a track. If the station is behind, then speed up to let it catch up.

So I studied the equations and came up with an orbital simulator where you chase a station with your ship and see how close you can get, by firing thrusters to speed up or slow down. By matching the orbital elements, you close in on the station.

A couple of other tips: to get to a circular orbit, get the eccentricity close to zero. Do this by firing thrusters prograde (in the direction you are going) near the hollow square which is apoapsis. This is when you are farthest from the earth (or whatever planet you want the blue circle to be.) Or fire retrograde near the solid square which is periapsis, when you are closest to the earth.

Try to match the period and radius of the station. I still need to add some help on the other elements shown, but they are less crucial. You can also just practice "flying" and see what changing speed at various points does to the orbit. The blue circle planet is virtual, so you'll pass right through instead of crashing or burning up in the atmosphere. Try not to fly off the screen though.

The simulation is here and runs on Java:

http://users.wowway.com/~jrlivermore/orbit/orbitpage.htm

Friday, April 2, 2010

Impressive company blog

If you have a design company that encourages employees to contribute concise and informative articles on the things they are working on, this is the way to do it. It looks very organized, clean, and professional.

http://www.dmcinfo.com/Blog.aspx

Monday, March 8, 2010

Error Correction Coding Receiver - Part 4

Back to the problem of decoding a linear block code with multiple correctable bits.

From page 15 of this [pdf] description mentioned in Part 3, there is a set of minimal polynomials for any particular system. These can be looked up in a table. This example shows three unique equations for the system of six syndrome equations for a triple error correcting code:

m1(x) = m2(x) = m4(x) = x^5 + x^2 + 1

m3(x) = m6(x) = x^5 + x^4 + x^3 + x^2 + 1

m5(x) = x^5 + x^4 + x^2 + x + 1

The syndrome vector can be obtained by taking the received codeword v(x) modulo with each of the minimal polynomials.

With the LFSR, it is a simple matter of setting the reduction polynomial to each of the minimal polynomials, shifting in the codeword, and saving the result as S1(x), S2(x), and so on.

Next, a system of equations in α is needed. This is achieved by taking S1(α), S2(α^2), S3(α^3) ... S6(α^6). Replacing x with α^i is the same as spreading out the "bits" in the polynomial in x with (i - 1) zeroes in between, and taking the result modulo the reduction polynomial for this field.

In this example, that means S1 just replaces x with α. S2 through S6 can each be fed into an LFSR with the spacing mentioned above, and with the reduction polynomial set appropriately. In this case that is x^5 + x^2 + 1.

Example: If S3(x) = x^3 + x^2 + 1, then feed in x^9 + x^6 + 1. Binary 01101 becomes 1001000001. The bolded zeroes show which ones were inserted to get the equation in α. This is fed into the LFSR to reduce it to fit in the field.

Now there is a set of six equations in α. We are ready for the Berlekamp-Massey algorithm.

Monday, February 22, 2010

Python Filter Design

I've never really done DSP as part of my day job. I've been to a few training classes and have some texts that just scratch the surface of filter design, modulation, and so on. Never really touched Matlab either. But I've gotten into Python, so I found out there are some interesting modules that handle DSP functions and the related math.

So how hard would it be to come up with a FIR filter that meets certain performance requirements? Well, you can play around with the Python SciPy and NumPy libraries and check the performance of your filter in a few easy lines.


"""
Design a FIR filter and show the frequency response in
a few easy lines
"""

from scipy import signal
from pylab import *

"""
Window types: boxcar, triang, blackman, hamming,
hanning, bartlett, parzen, bohman, blackmanharris,
nuttall, barthann, kaiser (needs beta),
gaussian (needs std),
general_gaussian (needs power, width),
slepian (needs width)
"""

def dbPlot(w, h):
plot(w, 20 * log(abs(h)) / log(10))

# firwin(number of taps, cutoff relative to Nyquist
# rate, window type)
b = signal.firwin(31, 0.4, window='nuttall')

# freqz(list of zeroes, list of poles)
(w, h) = signal.freqz(b, 1)

plot(w, log(abs(h))/log(10)*20)
show()


This gives me a nice graphic:



You can play around with different windows, as shown in the code comments, or add taps as needed to get the right transition and attenuation.

Saturday, February 13, 2010

The Importance of Resets, continued

Checked the Picoblaze support forum. The experts thought that the block RAMs should be settled by the time reset is released. I ran a little experiment with a program that just loads the general purpose with some known values, then writes them to the RAM. Then I ran a loop to configure the FPGA and check the contents of the memory. Once out of a few dozen to a couple hundred times, the first memory location came up wrong.

So it looks like I'll add a reset delay to run at configuration and insure that things are stable before my program starts up. This issue does not seem to have come up anywhere else, so it's kind of a mystery.

Sunday, February 7, 2010

Importance of resets

I interrupt the discussion on error correction coding to concentrate on a problem I'm seeing in an ongoing project.

The board comes up, software configures the FPGA, and about one out of two hundred times, the board immediately goes into reset. It appears that the FPGA is asserting a critical tempearture reset, but the board is at room temperature.

The FPGA reads the temperature from a sensor on the board via a SPI bus. So in reproducing the problem and being in this critical temperature state, the SPI signals are captured -- no problem, the temperature from the sensor looks correct. What next?

A Picoblaze microprocessor in the FPGA is doing the software to read the temperatures, figure alarms and do some fan control. It is still performing the sensor reads once per second. So it does not appear to have locked up. Maybe it's something between the SPI pins on the FPGA and the internal logic and the path to the Picoblaze?

Chipscope shows that all high alarm thresholds are set in this error state: high, very high, and critical. I can place a jumper and bypass the critical reset, to allow software to boot up, and I can check for the error state by looking for the alarm bits to be set. The Picoblaze writes out the temperatures to some registers, and these look correct, even in the error state. The values match the SPI signals from one reading to the next.

Is the Picoblaze getting confused? The first thing it does at FPGA configuration is to read out the temperature set points from a block RAM into its internal scratchpad. I'm realizing that, while I provided a register bit to allow software to hit the Picoblaze reset in case a reconfiguration is needed, it is not normally asserted at startup. So is this a problem of the RAMs not being quite settled right at the end of FPGA configuration? In any case, it seems like good practice to have a reset asserted right out of configuration to make sure we are starting in a good, known state.

Resetting the Picoblaze makes these persistent alarms go away and everything is back to normal.

The next experiment: perhaps have Picoblaze read out some data from the block RAM, then write back what it thinks it got. See if bad data is ever written back. That would suggest a root cause. Also, I've dug through the Virtex 4 data sheets to see if there is a time after the DONE bit goes active that the RAMs may not be ready. So far I haven't found a mention of this.