ECC Memory Errors

Home » CentOS » ECC Memory Errors
CentOS 20 Comments

I started to receive this kind of messages a few days ago on one of my servers:

Message from syslogd@ at Mon Apr 29 08:02:55 2013 … server1 kernel: EDAC MC0: UE row 0, channel-a= 0 channel-b= 1 labels “-“:
(Branch=0 DRAM-Bank=0 RDWR=Read RAS=0 CAS=0, UE Err=0x2 (Aliased Uncorrectable Non-Mirrored Demand Data ECC))

I’ve never had ECC memory to fail on me before, so now I am wondering the following:

* The server is running CentOS 5.7 and is acting as Xen dom0. Is there any possibility this could be a kernel issue and upgrading would help, or would upgrading at this point just cause more trouble?

* Is there now a possibility that my data can get corrupt: should I
shutdown the server as soon as possible or can I keep running until I
replace the memories?

* This server has been running for several years in a datacenter without problems: what are your experiences, are these kind of problems most likely caused by a failing motherboard or the memories?

Regards, Peter

20 thoughts on - ECC Memory Errors

  • Not in my experience.

    Maybe – I’m just not sure. You need to replace the memory asap; order it, and schedule a maintenance window with all your users *now*.

    DIMM went bad. No big thing. Your only problem may be to identify which one, he says, about to go into work to do just that.

    mark

  • Monday, April 29, 2013, 1:59:44 PM, you wrote:

    fully agrees, however I had a situation once where these messages appeared after a Kernel update. I learned that the kernel developers added some extra error messages for my chip set with that update. Booting an older kernel made the errors go away. OTOH, if the board is already some years old, this is not very likely (and you will have to replace the memory anyway.)

    best regards

  • Hi,

    Thanks for your response and suggestions.

    About identifying the faulty DIMM: Is the memtest provided on the CentOS5
    installation disk best tool for this purpose? And do I need to switch ECC
    off from BIOS while I test the memories?

    The EDAC error msg reports problems with bank0. Can I trust this? I tried installing edac-utils to get more information, but after installation it only generates segmentation fault:

    # edac-util –report=simple Segmentation fault

    # edac-util -s Segmentation fault

    # rpm -qv edac-utils edac-utils-0.9-6.el5

    Regards, Peter

  • Hi Peter

    One of my old HP DL585 had a similar issue but it turned out that the DIMM
    slots were at fault. The server chassis had few led blinking red for those DIMM slots and indicating that they are faulty. I removed the memory from those slot and re-inserted them to the spare DIMM slots and everything is working fine since then.

    Regards, Vipul

  • Vipul Agarwal wrote:
    Hi, there Vipul, old buddy, old pal…. I’ve *got* an HP DL580 G5 that was spitting ECC errors, too. It was fully populated, both the m/b and the four risers. I took my best guess last week and pulled one mirrored pair of memory (OP: make sure that memory isn’t mirrored – then you have to pull at *least* two), replaced them with two from a riser… and then had to take out *two* of the four risers.

    Now, on an HP support list, someone left a message over the weekend that I
    should do a BIOS update… except all I can find is a DOS .exe to do it,
    *and* there’s a comment about needing to install previous BIOS updates…. You don’t happen to know if I do need to install the previous update?

    mark, looking into a freedos USB key solution….

  • I think the “emergency boot cd” contains bootable freedos images…

    was looking for a URL for it, and I find many things under that name, but I’m not sure that any of them is the one I’m thinking of. I’ll post again if I find it.

  • Replying to myself:

    Replacing the first memory pair made the error messages go away.

    Edac-util still segfaults though. But as the system seems to be otheriwse stable, I probably will not investigate this further.

    Regards, Peter

  • From: “m.roth@5-cent.us”

    You can boot from the Firmware Maintenance CD. It will auto-detect all the hardware firmwares and update them if needed… You just have to find the most recent CD that still supports your model
    (see in the release notes), since they gradually remove old hw to make some room for new ones…

    JD

  • John Doe wrote:

    No such luck: we don’t have such a maintenance CD; if anyone does, it’s the other Institute, and as we’re doing admin work, I’d guess they don’t have a real admin, so who knows where it is.

    But wait, it’s worse than that, Jim…. I did, in fact, boot freedos from the USB key this morning… and when I tried to run it, it announced that it *can’t* be run in DOS mode.

    No, there’s no way anyone’s going to spring for a Windows license, install it, do the update, and redo the system as CentOS.

    Joking aside, do you, or does anyone, have an opinion on the advisability of trying to flash the BIOS under wine?

    mark

  • From: “m.roth@5-cent.us”

    The ISOSs are downloadable… Google “firmware maintenance cd” and check the “version history” to get the latest one. Then, try the “release notes” to see if you find your server model (not always listed). If not, go back a few versions until you find it. CD 8.60 by example seems to have it and is not too old… Download and burn. Or, you could try the new way (I never tried it yet):

  • John R Pierce wrote:

    Never heard of it. Just looked at it… winPE – is that the miniXP?

    And it bothers me that I’ve never heard of hirens – I *do* have to be aware of security. I’m still thinking of wine.

    mark

  • Hirens has been around for quite a while, and gets updated periodically. its a all-in-one CD or USB boot full of mostly open source tools, can boot into memtest86, a linux kernel/shell/gui environment, or into a ‘bartPE’ mini-XP environment. freedos too, I think.

    hirens themselves only distributes the kit to build it, since it involves some licensed software, but pre-built ones are available from somewhat marginal sources (bit-torrent, etc).

    you could, of course, simply use a plain BartPE boot, too, that you built yourself (Bart provides a toolkit for creating a winPE style environment). you need access to a Windows desktop system somewhere to build one of these, along with the Windows XP CD to get the required files.

  • John R Pierce wrote:

    Interesting. I need to look at them further.

    HOWEVER: I saw something there about unpacking an .exe… and googled that, and found someone talking about doing that… which led me to cabextract, and, sure ‘nough, I now have what was in that exe – flat files, CD, floppy! even a WinDoze floppy label printer!

    Now all I have to do is figure out which will be easiest to use – I’m hoping I can just copy the flat files or the floppy files, or USB files, and reboot into freedos and go.

    mark

  • Hope this works out. :)

    You could use a Windows 98 (or other Windows boot floppy) to boot and run the flashing utility. You can even boot up with your USB stick (has the utilities) plugged in ( it should become C: ).

    I’m sure you could find a Windows floppy image online or one of us would be happy to send you a copy (it would be about the security/trust you can have in freedos images anyways).

    Once you have the image it’s trivial to turn a floppy image into a bootable ISO. But it sounds as if you have floppies and a floppy drive.

    I think the freedos bits I saw in the HP bios updating archive for my system was for the “Crisis Recovery Disk”.

  • From: “m.roth@5-cent.us”

    While I still think it is easier to just download and boot on the DVD (which works for many models too), you can try:

LEAVE A COMMENT