2002-04-08 05:42:48-05
Hardware problems have been bugging me for a while.

For some time, there has been something not quite right with this webserver.
It hasn't been bad enough to cause any major issues (at least, not since I ran sysctl -w machdep.ddb_on_nmi=0 ; sysctl -w machdep.panic_on_nmi=0), but dmesg would keep revealing bursts of the following during heavy disk activity:
Assuming that the bits AND together, this indicates a memory parity error, an I/O error, and some undefined error.
These errors would show up from time to time, not causing any major issues (except for the occasional seg-fault of make while doing the tree cleanup stage of make buildworld, or while doing a make index under /usr/ports) and I felt that they were not worth the trouble of visiting the co-lo to take a look at the server.

Now they are increasing, and are being visited by their friend:
kernel trap 19 with interrupts disabled

As far as I can tell, this kernel trap indicates that an NMI came in, while the system was already servicing another NMI.

Not good...

The other day, one of my nightly Memtest runs showed an error. This also happened a couple of weeks ago, with the same bit being flipped, at the same memory location.

Even Worse...

So, tonight I'm going to visit the co-lo, swap out some memory for testing, replace drive cables with known good cables (Just in case - the faults occuring during heavy disk activity makes me want to change them), maybe underclock the processor if things seem too warm, and poke around with the rest of the system.
Hopefully I have better luck fixing this than Other people who have had the same problem...

2002-10-18 22:55:32-05
click to email

I doubt your system would went panic even with sysctl knobs turned on.

(see this /sys/i386/isa/intr_machdep.c)

My box is crying just the same way suddenly and it seems even FreeBSD's mail archive keeps nothing bout it... My the most belived suspection is IDE system, for now.

(you may contact me via this email: poige 4t morning d0t ru).

2002-10-28 23:36:21-05
click to email

When I went to check the server, memcheck86 showed no problems with the memory, the cables to the drives were in good condition, all the fans were working, and everything appeared to be in good condition..

I ended up installing a script to track the system temperature ( graphed at http://house.ofdoom.com/~hungerf3/temperature/ I use /usr/ports/sysutils/xmbmon to read the motherboard sensors) and noticed that the problem did seam slightly temperature related - when the system temperature would peak, a burst of log messages would appear. I decided that it must be a heat sensitive component on the motherboard, or some sensor unique to that design, and decided to just ignore them. I moved my system builds and other CPU intensive tasks to another system, and just uploaded the results.

I ignored the errors, and they gradually increased in frequency, with bursts of more and more showing up more often.

A few days ago, there was another development. The system went down with a failed hard drive. After I swapped out both drives with new, better quality drives, I have only seen 4 of those errors. Whereas before I would see dozens a day under a light load, these 4 only showed up when I was really stress testing the disks ( vinum resyncing, while doing a make buildworld).

My current theory is that the errors were from the hard drive that ended up failing - that the SMART system on the drive was trying to report an error to the computer, which couldn't decode the message and was showing it as the strange error message.

Good luck with solving your problem!


click to email