Sunday, November 30, 2008

Troubleshooting new PC bluescreens

When I first got my custom-built new PC running 64-bit Vista put together a few weeks ago, everything seemed to be running great: It was very fast and responsive, all of the hardware components appeared to be working, and I could play 3D games on the system with no errors.

However, after a few days of using the system, it became clear that there was a problem: On four occasions, after I left the machine running overnight, I woke up in the morning to find that the machine had bluescreened while it was unattended overnight.  Each time, I found that all of my open programs/windows had closed, and a error dialog was open with a message saying that the machine had bluescreened (using that term, "bluescreened"!).  (However, on two occasions, the machine ran ok overnight, without bluescreening.)

Also, on one occasion, the machine bluescreened while I was actively using it, while I was playing a game of Bionic Commando: Rearmed, which was especially aggravating.

I really dislike system instability.  I've always placed a premium on stability on systems I build; while troubleshooting and tracking down problems can sometimes be interesting, I'd much rather be spending my time on my computer to work on a project, or to play a game.  So, I set out to track down and fix the cause of the bluescreens.  (Note: This is the time that having a custom-built machine can be "interesting" -- if I couldn't figure out the cause of the issue, I wouldn't have the fallback option of dialing up a vendor's 1-800 number to get help dealing with the problem!)

Bad RAM?

My first thought was that one of the sticks of RAM in the system might be bad, or maybe that the two sets of two RAM sticks that I had put into the machine -- a set of 2 2GB sticks from Corsair, and a set of 2 2GB sticks from Crucial (8 GB total) -- were incompatible with one another.  I wasn't terribly happy about this prospect, since it would involve additional troubleshooting which stick(s) of RAM were responsible for the problem, and then having to ship the parts back to the store for a replacement or a refund -- something I've never needed to do before.

I decided to use a memory test utility to try and determine whether there really was a RAM issue.  I found a nice blog post by Shivaranjan Bhoopathy detailing Vista's built-in memory diagnostic tool (thanks Shivaranjan!).  I had been previously unaware of this utility; I'd had it in mind that I'd need to find a 3rd-party utility to do the job.

I ran the utility (which was designed to run after a reboot of the machine, and then automatically running the utility on the subsequent boot before Windows loads).  To my relief, the utility reported that all of my RAM was ok!  However, this meant that I needed to continue looking for the cause of the bluescreens.

Heat Issue?

My experience over the years has shown that weird, sometimes-reproducible, sometimes-not, hardware-related issues are often attributable to overheating. 

I found a nice, free utility for Windows, SpeedFan, which gives a readout of CPU temperature (among other features).  SpeedFan reported that my two CPUs were running at a temperature of between 70 and 75 degrees C -- very hot, dangerously so, for the CPU! 

I also rebooted the machine, entered the built-in BIOS utility program as the machine was booting, and checked the temperature there; the BIOS utility program confirmed that the CPU temperature was a very high 70+ degrees C.

So, at this point I thought I'd found the cause of the problem; the only question was how to fix it.  I turned off the machine, opened up the case, and checked the heatsink.  I found that the heatsink was slightly loose -- I was able to wiggle it back and forth slightly with my fingers; if I had installed the heatsink correctly, then I shouldn't have been able to move it at all. 

The problem turned out to be that two of the four "posts" on the corners of the heatsink which bolt the heatsink tightly down against the surface of the CPU were not tightened down all of the way.  As a result, the heatsink wasn't making tight contact against the CPU surface, and consequently wasn't doing a good job of drawing the heat away from the CPU. 

I properly tightened down the heatsink, and confirmed that it was now tightly bolted down against the CPU surface, and couldn't be "wiggled."  I turned the machine back on, and monitored the temperature with SpeedFan.  This time, the CPU temperature never rose above 40-45 degrees C, even after the machine had been on for a while.  Much better!

Unfortunately, after I left the machine on overnight once again, I came back to it in the morning to find that it had, once again, bluescreened while it was unattended overnight.  This meant that I needed to continue looking for the cause of the issue. 

BIOS and Network Driver Update

At this point, I was running out of ideas of things to check.  I had been doing some large file copies over the network while the machine was unattended overnight (copying photos and music files from my old PC to the new one); I thought that maybe a problem with the network driver or the machine BIOS might be responsible for the bluescreening problems.

I visited the Foxconn downloads site (my motherboard manufacturer's site), and downloaded a new Network driver and installed that; then (unfortunately violating the troubleshooting principle of "only change one thing at a time between tests"), I also downloaded and installed an updated BIOS, using the Foxconn LiveUpdate utility, also from the Foxconn site.

After the BIOS update, I was afraid momentarily that I had "bricked" my motherboard when, after the machine rebooted following the update, I was presented with a scary-looking error message following the machine's power-on self-tests:

CMOS Checksum Bad

However, after some hurried research via Google search (on another machine), this error message turned out only to represent a notification that the machine's BIOS had been updated.  I was able to just bypass the error and continue to boot into Windows, and the machine was fine.

This notification is a good thing, in the case that I might have had a virus that had performed a BIOS update (for who-knows-what purposes).  However, (1) the error message was somewhat unnecessarily scary/unhelpful, and (2) it might have been nice if the Foxconn update utility would have warned me about the message in advance, so I didn't have to get so worried upon seeing it! 

The same "CMOS Checksum Bad" error message appeared again upon a subsequent boot, but I was (apparently) able to clear it simply by going into the machine's boot-time BIOS utility, and then doing a save-and-exit from the utility (without changing anything).

Conclusion

In any event, after installing the BIOS and Network driver updates, I've had no further bluescreening problems!  The machine has been rock-solid stable ever since -- just the way I like it.

I can conclude that either or both of the BIOS and Network driver updates was responsible for fixing the problem -- although as I noted earlier, it would have been nice if I'd performed the updates one at a time, so I could better conclude what the specific solution to the problem was.

I'm also happy in retrospect that the bluescreens occurred, since it led me to discovering the heat issue with the machine and the improperly-installed heatsink; if I hadn't noticed that, letting the machine run for a long period of time at 70+ degrees C might have had a significant negative impact on the life of the CPU.  I also got to discover a couple of cool utilities that I hadn't been previously aware of, namely, the built-in Vista memory diagnostic tool, and the SpeedFan temperature-monitoring utility.

3 comments:

  1. I would put my money on the BIOS. I had some blue-screen issues that were always reporting problems with the virus-scan software that went away with a BIOS update. For me the particularly frustrating part about the deal was that the BIOS version on the machine when I got it was already out of date -- you'd think they could update to latest prior to shipping!!!

    Hopefully you'll not see any more issues now that you've done all that, and fixing the heat-sink is probably a very good thing anyway -- you didn't want to have to replace CPUs in a couple months...

    ReplyDelete
  2. @Chris, thanks for the comment -- hearing that you had the same problem (or at least a similar one -- bluescreening), and got it solved the same way (BIOS update), makes me feel better about my solution.

    I agree -- having to update the BIOS yourself instead of having it done by the manufacturer is a hassle. Although I suppose it's understandable if the case is that the retail box that I bought had been sealed and sitting on a warehouse shelf somewhere for a year or two before I bought it -- which wouldn't surprise me, given that I bought a relatively inexpensive motherboard.

    Maybe you get a newer BIOS if you drop $250 on a motherboard (but I really wouldn't know). :-)

    Finally, I definitely agree, discovering that I had installed the CPU heatsink improperly and that the CPU was running very hot was definitely a very positive side-effect of the whole exercise. (I wouldn't be surprised if that one bluescreen I had while playing a game was due to an overheat, rather than the BIOS issue, since the game was also putting a pretty good load on the CPU.)

    ReplyDelete
  3. Yeah, I hadn't thought about the fact that your motherboard was probably a little more 'mature.' My laptop was supposedly put together when it was ordered, but who knows how long the components were on shelves prior to 'assembly' (or how long the machine sat assembled prior to actually being ordered). Anyway, I used to have blue-screens similar to what you reported (rebooted after sitting for periods unattended) and since the BIOS update, it has not happened -- and it has been several months now. I hope you'll have the same good fortune...

    ReplyDelete