DIMM Problem

Home » CentOS » DIMM Problem
CentOS 13 Comments

Hey, folks,

I’ve got an HP proliant DL580 G5 throwing ECC errors. This is annoying, since a) it’s all new as of a few months ago, and b) it’s *fully*
populated. The two things I need to figure out are a) *which* DIMM it is, and b) is it mirrored; if so, which *other* DIMM needs to come out until we get replacements from the OEM.

Here’s one of many, all identical, from dmesg:
EDAC MC0: CE row 12, channel 1, label “”: Corrected error (Branch=0, Channel 1), DRAM-Bank=2 RD RAS

13 thoughts on - DIMM Problem

  • John R Pierce said the following on 24/04/2013 19:43:

    A ProLiant G5 is all but “new” :)

    Better buy some compatible RAM because the original HP for old servers is very expensive.

    Ciao, luigi

  • Luigi Rosa wrote:

    The *memory* was new – I replaced all, I think, of the original memory. The server’s from ’09. If they had a warranty, it’s well past that, and HP
    won’t chat or email without $$$.

    mark

  • m.roth@5-cent.us said the following on 24/04/2013 19:51:

    ProLiant DL 580 servers have an integrated log.

    If you boot with SmartStart CD you can run “Integrated Management Log Viewer”
    application and see if the system has logged some event related to ECC memory.

    If you find some errors about ECC memory, you have a fault memory module (the entry in the integrated log SHOULD say what module is faulty).

    If the memory module is new you should be able to get a replacement.

    Ciao, luigi

  • Is there anything in the iml log on ILO? Also did you try just re-seating the memory or moving it into other slots to see if you can track it down that way??

    Regards,

  • ah, he said ‘all new as of a few months ago’. actually, thats a
    2009-ish server, with quad Tigerton/Dunnington quadcore processors
    (roughly equivalent to the Core2 Q series), and it came with a 3 year onsite warranty. It uses PC2-5300 fully buffered dram (ddr2-6700) so yeah, that ram is going to be expensive since its an older generation.
    looks to take 4×4 1/2/4/8GB sticks, in pairs, but if you have all 4
    CPUs, you probably need to populate all 4 banks. oh, and there were memory expansion mezzenaine cards, so bringing it up to 32 dimms total.

    I figured for laughs I’d try and find a ram layout for it, but best I’ve found says its printed on the lid to the CPU/memory module. http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?lang=en&cc=us&taskId5&prodSeriesId454575&prodTypeId351&prodSeriesId454575&objectID

  • Luigi Rosa wrote:
    to ECC

    Oh, I know I can get a replacement. In the meantime, it’s in *use*, and I
    need to arrange to be able to take it down. Then there’s the issue of what comes out – it’s got, I don’t remember 32 DIMMs maybe, including 3 or 4
    riser boards. The bank=2 makes me *think* it’s riser 2, but which of the four? And where’s it’s mirror (I think it’s mirrored memory).

    Good idea, though, and I just installed OpenIPMI and ipmitool… and the only thing that ipmitool sel list shows is a power supply failure yesterday. I did go into the datacenter and look at it, and it’s got this cute pull-out little display… and it’s not showing any of the DIMMs as failing, which goes with the results of cat /sys/devices/system/edac/mc/mc0/csrow*/*count *all* giving me zero, though /sys/devices/system/edac/mc/mc0/ce_count shows 20260 and rising.

    mark

  • John R Pierce wrote:

    I need to consider that – it might help. Anyway, that’s it… and
    *everything* is populated. Another HBS (you know, the technical term, Honkin’ Big Server… and I forget how many U’s)

  • From: “m.roth@5-cent.us”

    Install the hp-health tools, and use hplog to get more info Might need to install compat-libstdc++, and temporarily put back the default ‘/etc/redhat-release’
    And, if you do not have them already, there is:

  • It won’t help you on troubleshooting which RAM module is bad, but dmidecode may be helpful in figuring out how many slots/sticks you have and what’s populated and not populated.

    Typically if the lights are not on on that display, the RAM is tossing ECC errors or similar, but not fully failing. I have a bunch of G6 and G7 machines, but no G5 to look at to assist you.

    The G8 machines should have been given a different model number, they’re completely different beasts (and there’s a number of things I’m learning to dislike about them, honestly).

    I had a G7 that kept setting the RAM lights as if it had a RAM problem, so the server support vendor visited more than once. The real problem was a failing CPU. I mention it, because I’ve seen RAM problems that really weren’t and were misdiagnosed by the relatively crude monitoring built into those motherboards, more than once. I’ve also run the HP diagnostics for a full day, and had it find absolutely nothing, and have the lights come back on 5 minutes after firing the on-disk OS back up. Same thing with other tools like memtest86.

    Swap the RAM out completely. If that doesn’t fix it, swap the associated processor out. I’ve never seen any other hardware in those pizza box machines be the cause of the RAM problems you’re seeing.

    If you can’t swap it completely, swap sides and move it to the other side. See if it follows the RAM or the slots. Often it follows the slots, and the problem is the CPU which talks to that “half” of the motherboard, not the RAM.

    At least that’s what I’ve seen… YMMV.

    Nate

  • thats not altogether surprising, the newer Intel CPUs (and all the AMD
    Opterons) integrate the RAM controller into the CPU chip, and it is basically impossible to tell which is at fault.

  • Nathan Duehr wrote:
    Heh. It’s *fully* populated, the whole m/b, and all four optional risers.

    That’s what it’s doing, ECC correctable. BUT
    /sys/devices/system/edac/mc/mc0/ce_count showed, as I noted, a ton of errors, but under mc0 was csrow[0-7], and the ce_count in each was *0* –
    not sure how that could be, but it was.

    Can’t do that. I don’t have 256G of FBDIMMs laying around, nor do I have another identical box (well, maybe one, and I’m about to surplus that).

    But it’s worse than that, Jim…. The memory’s *mirrored*, *and* it’s requiring the entire m/b to be populated before the optional risers… and the optional risers are each paired.


    Thanks, Nate, I was just hoping someone could show me how to translate what the kernel’s throwing to be able to identify the explicit DIMM.

    And a) it’s technically not ours, it belongs to another Institute, but they’re doing intrmural work, and we’re running it, and b) it’s long out of warranty, so I can’t even talk to HP.

    What I’ve done so far, after scheduling downtime, was to pull DIMM 2c, and its mate 6c, then take two from riser 4, and put them on the m/b. After a couple of reboots, I discovered that a) I couldn’t put it all back without those two DIMMS on riser 4, nor could I just leave riser 4 out, I had to pull *both* riser 3 and 4.

    It’s been back up all day, I ran stress on it for a bit, and my user tried some stuff, and no errors, so I now know that it’s a DIMM on one of those two risers, or it’s one of the ones I pulled from the m/b. Only 1 of 8, instead of 1 of 32….

    In addition, after much googling, I finally found HP system management, and the SIM, separately. Installed them… and SIM seems as though it’s missing something. I try to log on, via the SM homepage, and it takes better than 5 min to get to the page. When I click on memory in system, that takes a number of minutes… and tells me *nothing* at all, where the SM web page at least used to show me what’s occupied.

    Annoyances, all the way around. I expect to bounce the system tomorrow morning, and put the two I pulled from the m/b onto riser 4, then pull riser 2 and replace it with 4; hopefully, we’ll see errors, and I’ll be down to 1 of four bad.

    mark

  • I never managed to get anything useful out of that HP system management stuff, I think I tried from scratch 3 times, on different systems. I
    dunno what bit I was missing, the instructions are all for seperate parts, and obviously, I was missing something, but I’d end up with a web framework that had nothing useful in it. at the time I was more interested in the raid management, but there wasn’t any info on the cpu hardware, either.

LEAVE A COMMENT