Once it was said...
   
2002-04-08 05:42:48-05
 
 
  
Hardware problems have been bugging me for a while.

For some time, there has been something not quite right with this webserver.
It hasn't been bad enough to cause any major issues (at least, not since I ran sysctl -w machdep.ddb_on_nmi=0 ; sysctl -w machdep.panic_on_nmi=0), but dmesg would keep revealing bursts of the following during heavy disk activity:
NMI ISA 3c, EISA ff
NMI ISA 2c, EISA ff
NMI ISA 2c, EISA ff
NMI ISA 3c, EISA ff
Assuming that the bits AND together, this indicates a memory parity error, an I/O error, and some undefined error.
These errors would show up from time to time, not causing any major issues (except for the occasional seg-fault of make while doing the tree cleanup stage of make buildworld, or while doing a make index under /usr/ports) and I felt that they were not worth the trouble of visiting the co-lo to take a look at the server.

Now they are increasing, and are being visited by their friend:
kernel trap 19 with interrupts disabled

As far as I can tell, this kernel trap indicates that an NMI came in, while the system was already servicing another NMI.

Not good...

The other day, one of my nightly Memtest runs showed an error. This also happened a couple of weeks ago, with the same bit being flipped, at the same memory location.

Even Worse...

So, tonight I'm going to visit the co-lo, swap out some memory for testing, replace drive cables with known good cables (Just in case - the faults occuring during heavy disk activity makes me want to change them), maybe underclock the processor if things seem too warm, and poke around with the rest of the system.
Hopefully I have better luck fixing this than Other people who have had the same problem...


reply

And you replied...
Name:
E-mail:
Web: