Once it was said...
   
2002-10-28 23:36:21-05
 
 
  
Pathwalker
click to email

When I went to check the server, memcheck86 showed no problems with the memory, the cables to the drives were in good condition, all the fans were working, and everything appeared to be in good condition..

I ended up installing a script to track the system temperature ( graphed at http://house.ofdoom.com/~hungerf3/temperature/ I use /usr/ports/sysutils/xmbmon to read the motherboard sensors) and noticed that the problem did seam slightly temperature related - when the system temperature would peak, a burst of log messages would appear. I decided that it must be a heat sensitive component on the motherboard, or some sensor unique to that design, and decided to just ignore them. I moved my system builds and other CPU intensive tasks to another system, and just uploaded the results.

I ignored the errors, and they gradually increased in frequency, with bursts of more and more showing up more often.

A few days ago, there was another development. The system went down with a failed hard drive. After I swapped out both drives with new, better quality drives, I have only seen 4 of those errors. Whereas before I would see dozens a day under a light load, these 4 only showed up when I was really stress testing the disks ( vinum resyncing, while doing a make buildworld).

My current theory is that the errors were from the hard drive that ended up failing - that the SMART system on the drive was trying to report an error to the computer, which couldn't decode the message and was showing it as the strange error message.

Good luck with solving your problem!


reply

And you replied...
Name:
E-mail:
Web: