I interrupt the discussion on error correction coding to concentrate on a problem I'm seeing in an ongoing project.
The board comes up, software configures the FPGA, and about one out of two hundred times, the board immediately goes into reset. It appears that the FPGA is asserting a critical tempearture reset, but the board is at room temperature.
The FPGA reads the temperature from a sensor on the board via a SPI bus. So in reproducing the problem and being in this critical temperature state, the SPI signals are captured -- no problem, the temperature from the sensor looks correct. What next?
A Picoblaze microprocessor in the FPGA is doing the software to read the temperatures, figure alarms and do some fan control. It is still performing the sensor reads once per second. So it does not appear to have locked up. Maybe it's something between the SPI pins on the FPGA and the internal logic and the path to the Picoblaze?
Chipscope shows that all high alarm thresholds are set in this error state: high, very high, and critical. I can place a jumper and bypass the critical reset, to allow software to boot up, and I can check for the error state by looking for the alarm bits to be set. The Picoblaze writes out the temperatures to some registers, and these look correct, even in the error state. The values match the SPI signals from one reading to the next.
Is the Picoblaze getting confused? The first thing it does at FPGA configuration is to read out the temperature set points from a block RAM into its internal scratchpad. I'm realizing that, while I provided a register bit to allow software to hit the Picoblaze reset in case a reconfiguration is needed, it is not normally asserted at startup. So is this a problem of the RAMs not being quite settled right at the end of FPGA configuration? In any case, it seems like good practice to have a reset asserted right out of configuration to make sure we are starting in a good, known state.
Resetting the Picoblaze makes these persistent alarms go away and everything is back to normal.
The next experiment: perhaps have Picoblaze read out some data from the block RAM, then write back what it thinks it got. See if bad data is ever written back. That would suggest a root cause. Also, I've dug through the Virtex 4 data sheets to see if there is a time after the DONE bit goes active that the RAMs may not be ready. So far I haven't found a mention of this.