Filesystem Corruption?

Home » CentOS » Filesystem Corruption?
CentOS 3 Comments

Got an older server here, running CentOS 6.6 (64-bit). Suddenly, at
0-dark-30 yesterday morning, we had failures to connect.

After several tries to reboot and get working, I tried yum update, and that failed, complaining of an python krb5 error. With more investigation, I discovered that logins were failing as there was a problem with pam;
this turned out to be it couldn’t open /lib64/security/pam_permit.so. The reason for that was that it was a broken symlink, pointing to a file in the same directory, that actually existed in the /lib64. Checking other systems, I found it should, in fact, be a file, not a symlink.

At this point, the system was considered suspect. I brought the system down, replaced the root drive, and rebuilt. I was not able to build it as CentOS 7, as something in the older hardware broke the install. CentOS 6
built successfully, and the server was returned to service.

I then loaded the drive in another server, and examined it. fsck reported both / and /boot were clean, but when I redid this with fask -c, to check for bad blocks, it found many multiply-claimed blocks.

First question: anyone have an idea why it showed as clean, until I
checked for bad blocks? Would that just be because I’d gracefully shut down the original server, and it mounted ok on the other server?

Mounting it on /mnt, I found no driver errors being reported in the logs, nor anything happening, including logons, before an automated contact from another server, which failed. AND I checked our loghost, and nothing odd shows there, neither in message nor in secure.

At this point, I *think* it’s filesystem corruption, rather than a compromised system, but I’d really like to hear anyone’s thoughts on this.

mark

3 thoughts on - Filesystem Corruption?

  • Just running fsck with no arguments will not do anything unless the filesystem is unclean or the time interval between checks has expired. I suspect that fsck -f would have found problems as well.

    Time will tell if there is a hardware problem with the system, but I
    would probably run some hardware diagnostics on the server including memory and IO tests if you wanted to be on the safe side. You could also reformat the disk and run some write/readback diagnostics if you wanted to find out if the disk is bad.

    Nataraj

  • Someone has suggested to reformat disk. Before doing that you may want to make an image of the whole drive as it is now: dd the whole device into file (somewhere on huge filesystem). I definitely would do that before even running fsck or badblocks (BTW, badblocks has non-destructive mode) – too late to mention now. You may need this image for future forensics.

    The best would be to have some system integrity suite installed before bad event, then you will be able to tell what exactly changed (and approximately when). Alas, you don’t seem to have that option. You should be able to use backup as a sort of replacement for that: (hopefully you back up system area as well). I would restore all on the closest date before event, compare all you had with what you see on mounted image(s) of your drive (I would definitely play with copy of copy of image, leaving original intact). I definitely would mount them read only with no journal. Take a look in logs what kind of events you find there. Check that logs were not tampered with (chkrootkit may be your friend). Take a look who logged when for how long (and from where!), see if there is correlation with some segfaults or kernel oopses, or if some kernel modules were loaded (should they be loaded all of a sudden?). Anyway, take some forensics guide if you don’t do forensics often, and follow it. May take a couple of weeks depending on how busy you are in general. Good luck with that.

    Hardware (drive) hypothesis. It is very attractive. I would kick myself so wishful thinking will not take over. But if you indeed noticed bad blocks detected, this quite likely is your case. Again, logs must have records as drive will report its hardware events. I also would check SMART status of drive. Try to get some information from drive (hdparm comes to my mind, careful, you don’t want to change anything which mostly hdparm is used for, just collect info). After everything else tried I would run hard drive fitness test (vendors have downloadable utility). BTW, what is model/manufacturer of the drive?

    [There is one more possibility which unlikely is your case: bad memory, or just just small memory error but in really bad place that cased big consequences. Reboot would resolve trouble, so it is unlikely your case. But if this hits specific place in RAM, it can cause corruption of filesystem as well…]

    Good luck! Let us know what you find out.

    Valeri

    ++++++++++++++++++++++++++++++++++++++++
    Valeri Galtsev Sr System Administrator Department of Astronomy and Astrophysics Kavli Institute for Cosmological Physics University of Chicago Phone: 773-702-4247
    ++++++++++++++++++++++++++++++++++++++++