HDD Badblocks

Home » CentOS » HDD Badblocks
CentOS 28 Comments

Hi list, I’ve a notebook with C7 (1511). This notebook has 2 disk (640 GB) and I’ve configured them with MD at level 1. Some days ago I’ve noticed some critical slowdown while opening applications.

First of all I’ve disabled acpi on disks.

I’ve checked disk for badblocks 4 consecutive times for disk sda and sdb and I’ve noticed a strange behaviour.

On sdb there are not problem but with sda:

1) First run badblocks reports 28 badblocks on disk
2) Second run badblocks reports 32 badblocks
3) Third reports 102 badblocks
4) Last run reports 92 badblocks.

Running smartctl after the last badblocks check I’ve noticed that Current_Pending_Sector was 32 (not 92 as badblocks found).

To force sector reallocation I’ve filled the disk up to 100%, runned again badblocks and 0 badblocks found. Running again smartctl, Current_Pending_Sector 0 but Reallocated_Event Count = 0.

Why each consecutive run of badblocks reports different results?
Why smartctl does not update Reallocated_Event_Count?
Badblocks found on sda increase/decrease without a clean reason. This behaviuor can be related with raid (if a disk had badblocks this badblock can be replicated on second disk?)?

What other test I can perform to verify disks problems?

Thanks in advance.

28 thoughts on - HDD Badblocks

  • Have you ran a “long” smart test on the drive? Smartctl -t long device

    I’m not sure what’s going on with your drive. But if it were mine, I’d want to replace it. If there are issues, that long smart check ought to turn up something, and in my experience, that’s enough for a manufacturer to do a warranty replacement.

  • I agree with Matt. Go ahead and run a few of the S.M.A.R.T. tests. I
    can almost guarantee based off of your description of your problem that they will fail.

    badblocks(8) is a very antiquated tool. Almost every hard drive has a few bad sectors from the factory. Very old hard drives used to have a list of the bad sectors printed on the front of the label. When you first created a filesystem you had to enter all of the bad sectors from the label so that the filesystem wouldn’t store data there. Years later, more bad sectors would form and you could enter them into the filesystem by discovering them using a tool like badblocks(8).

    Today, drives do all of this work automatically. The manufacturer of a hard drive will scan the entire surface and write the bad sectors into a section of the hard drive’s electronics known as the P-list. The controller on the drive will automatically remap these sectors to a small area of unused sectors set aside for this very purpose. Later if more bad sectors form, hard drives when they see a bad sector will enter it into a list known as the G-list and then remap this sector to other sectors in the unused area of the drive I mentioned earlier.

    Basically under normal conditions, the end user should NEVER see bad sectors from their perspective. If badblocks(8) is reporting bad sectors, it is very likely that enough bad sectors have formed to the point where the unused reserved sectors is depleted of replacement sectors. While in theory you could run badblocks(8) and pass it to the filesystem, I can ensure you that the growth of bad sectors at this point has reached a point in which it will continue.

    I’d stop using that hard drive, pull any important data, and then proceed to run S.M.A.R.T. tests so if the drive is under warranty you can have it replaced.

    Brandon Vincent

  • Il 17/01/2016 18:46, Brandon Vincent ha scritto:
    I’m running long smart test. I’ll report data when finished

  • Il 17/01/2016 19:36, Alessandro Baggi ha scritto:

    I’ve performed smartctl test on sda. This is the result from smartctl -a
    /dev/sda:

    smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.10.0-327.4.4.el7.x86_64]
    (local build)
    Copyright (C) 2002-13, Bruce Allen, Christian Franke, http://www.smartmontools.org

    === START OF INFORMATION SECTION ==Device Model: WDC WD6400BEVT80A0RT0
    Serial Number: WD-WXF0AB9Y6939
    LU WWN Device Id: 5 0014ee 0ac91c337
    Firmware Version: 01.01A01
    User Capacity: 640,135,028,736 bytes [640 GB]
    Sector Size: 512 bytes logical/physical Rotation Rate: 5400 rpm Device is: Not in smartctl database [for details use: -P showall]
    ATA Version is: ATA8-ACS (minor revision not indicated)
    SATA Version is: SATA 2.6, 3.0 Gb/s Local Time is: Mon Jan 18 09:42:01 2016 CET
    SMART support is: Available – device has SMART capability. SMART support is: Enabled

    === START OF READ SMART DATA SECTION ==SMART overall-health self-assessment test result: PASSED

    General SMART Values:
    Offline data collection status: (0x00) Offline data collection activity
    was never started.
    Auto Offline Data Collection:
    Disabled. Self-test execution status: ( 0) The previous self-test routine completed
    without error or no self-test has ever
    been run. Total time to complete Offline data collection: (15960) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate.
    Auto Offline data collection on/off support.
    Suspend Offline collection upon new
    command.
    Offline surface scan supported.
    Self-test supported.
    Conveyance Self-test supported.
    Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering
    power-saving mode.
    Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported.
    General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 185) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x7037) SCT Status supported.
    SCT Feature Control supported.
    SCT Data Table supported.

    SMART Attributes Data Structure revision number: 16
    Vendor Specific SMART Attributes with Thresholds:
    ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
    UPDATED WHEN_FAILED RAW_VALUE
    1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always – 0
    3 Spin_Up_Time 0x0027 185 170 021 Pre-fail Always – 1716
    4 Start_Stop_Count 0x0032 067 067 000 Old_age Always – 33362
    5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always – 0
    7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always – 0
    9 Power_On_Hours 0x0032 098 098 000 Old_age Always – 2129
    10 Spin_Retry_Count 0x0032 100 100 051 Old_age Always – 0
    11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always – 0
    12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always – 1572
    191 G-Sense_Error_Rate 0x0032 001 001 000 Old_age Always
    – 529
    192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always
    – 257
    193 Load_Cycle_Count 0x0032 098 098 000 Old_age Always
    – 308088
    194 Temperature_Celsius 0x0022 121 089 000 Old_age Always
    – 26
    196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always
    – 0
    197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always
    – 0
    198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline – 0
    199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always
    – 0
    200 Multi_Zone_Error_Rate 0x0008 200 200 051 Old_age Offline – 0

    SMART Error Log Version: 1
    No Errors Logged

    SMART Self-test log structure revision number 1
    Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
    # 1 Extended offline Completed without error 00% 2126

    # 2 Short offline Completed without error 00% 1626

    # 3 Extended captive Interrupted (host reset) 90% 379

    # 4 Short offline Completed without error 00% 327

    SMART Selective self-test log data structure revision number 1
    SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
    1 0 0 Not_testing
    2 0 0 Not_testing
    3 0 0 Not_testing
    4 0 0 Not_testing
    5 0 0 Not_testing Selective self-test flags (0x0):
    After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.

    Seem that there are not errors. This disk is failing?

  • Also useful, complete dmesg posted somewhere (unless your MUA can be set to not wrap lines)

    Chris Murphy

  • Il 18/01/2016 12:09, Chris Murphy ha scritto:
    SCT Error Recovery Control command not supported

  • That’s strange, I expected the SMART test to show some issues. Personally, I’m still not confident in that drive. Can you check cabling? Another possibility is that there is a cable that has vibrated into a marginal state. Probably a long shot, but if it’s easy to get physical access to the machine, and you can afford the downtime to shut it down, open up the chassis and re-seat the drive and cables.

    Every now and then I have PCIe cards that work fine for years, then suddenly disappear after a reboot. I re-seat them and they go back to being fine for years. So I believe vibration does sometimes play a role in mysterious problems that creep up from time to time.

  • That wouldn’t explain the SMART data reporting pending sectors.

    According to spec, a drive may not reallocate sectors after a read error if it’s later able to read the sector successfully. That’s probably what happened here.

    Drives are consumable items in computing. They have to be replaced eventually. Read errors are often an early sign of failure. The drive may continue to work for a while before it fails. The only question is:
    is the value of whatever amount of time it has left greater than the cost of replacing it?

  • Not new: I can remember seeing DEC engineers cleaning up the contacts on memory boards for a VAX 11/782 with a pencil eraser c.1985. It’s still a pretty standard first fix to reseat a card or connector.

    —–BEGIN PGP SIGNATURE—

  • I used to do that as well. The contacts would come out nice and shiny when you clean them. Then I found out that what I was actually doing was removing the very thin layer of gold plating on the contacts and revealing the copper underneath. That’s why you should never clean contacts with a pencil eraser, just re-seat the boards and they’ll make contact again.

    Peter
    —–BEGIN PGP SIGNATURE—

  • It’s dying. Replace it now.

    On a modern hard disk, you should *never* see bad sectors, because the drive is busy hiding all the bad sectors it does find, then telling you everything is fine.

    Once the drive has swept so many problems under the rug that it is forced to admit to normal user space programs (e.g. badblocks) that there are bad sectors, it’s because the spare sector pool is full. At that point, the only safe remediation is to replace the disk.

    SMART is allowed to lie to you. That’s why there’s the RAW_VALUE column, yet there is no explanation in the manual as to what that value means. The reason is, the low-level meanings of these values are documented by the drive manufacturers. “92” is not necessarily a sector count. For all you know, it is reporting that there are currently 92 lemmings in midair off the fjords of Finland.

    The only important results here are:

    a) the numbers are nonzero b) the numbers are changing

    That is all. A zero value just means it hasn’t failed *yet*, and a static nonzero value means the drive has temporarily arrested its failures-in-progress.

    There is no such thing as a hard drive with zero actual bad sectors, just one that has space left in its spare sector pool. A “working” drive is one that is swapping sectors from the spare pool rarely enough that it is expected not to empty the pool before the warranty expires.

    Because physics. The highly competitive nature of the HDD business plus the relentless drive of Moore’s Business Law — as it should be called, since it is not a physical law, just an arbitrary fiction that the tech industry has bought into as the ground rules for the game — pushes the manufacturers to design them right up against the ragged edge of functionality.

    HDD manufacturers could solve all of this by making them with 1/4 the capacity and twice the cost and get 10x the reliability. And they do: they’re called SAS drives. :)

    Because SMART lies.

    Quit poking the tiger to see if it will bite you. Replace the bad disk and resilver that mirror before you lose the other disk, too.

  • The drive is disqualified unless your usecase can tolerate the possibly very high error recovery time for these drives.

    Do a search for Red Hat documentation on the SCSI Command Timer. By default this is 30 seconds. You’ll have to raise this to 120 out maybe even 180
    depending on the maximum time the drive attempts to recover. The SCSI
    Command Timer is a kernel seeing per block device. Basically it’s giving up, and resetting the link to drive because while the drive is in deep recovery it doesn’t respond to anything.

    Chris Murphy

  • That’s just masking the problem, his setup will still be misconfigured for RAID.

    It’s a 512e AF drive? If so, the bad sector count is inflated by 8. In reality less than 15 sectors are bad. And none have been reallocated due to misconfiguration.

    Chris Murphy

  • agreed

    thats not actually true. the drive will report ‘bad sector’ if you try and read data that the drive simply can’t read. you wouldn’t want it to return bad data and say its OK. many(most?) drives won’t actually remap to a bad sector until you write new data over that block number, since they don’t want to copy bad data without any way of telling the OS the data is invalid. these pending remaps are listed under smart parameter 197 Current_Pending_Sector


    john r pierce, recycling bits in santa cruz

  • I suspect that the gold layer on edge connectors 30-odd years ago was a lot thicker than on modern cards. We are talking contacts on 0.1″
    spacing not some modern 1/10 of a knat’s whisker. (Off topic) I also remember seeing engineers determine which memory chip was at fault and replacing the chip using a soldering iron. Try that on a DIMM!

    —–BEGIN PGP SIGNATURE—

  • Apparently, you know more about modern drives than I do, but as far as I
    know it is a bit longer story when bad block is discovered. Here it is.

    Basically, bad blocks are discovered on read operation when CRC (cyclic redundancy check) sum does not match. (in fact it is a bit more sophisticated than just CRC, as modern high data density drives are trying to match some analog signal they get on read head to digitally coded upon record). When this discovery happens, firmware decides, this is a bad block, adds its new location in badblock re-allocation table (a while ago when I learned this this reallocation table was located in non-volatile memory of drive controller board). Then firmware hold all other tasks and tries to recover the information stored in bad block. It re-reads it and superimposes read results until the CRC matches and then writes recovered data into re-allocated place, or gives up after some large number of attempts, then it writes whatever garbage it ends up with into re-allocated place and reports fatal read error. This attempt of recovery of bad blocks very noticeably slows down IO on device. So, “freezing” on some IO when accessing files may be indication of developing of multiple bad blocks. Time to replace the drive. The drive (even after irrecoverable
    – fatal – read error) is still considered usable, only when bad block re-allocation table fills up, the drive starts reporting that it is “out of specs”.

    On a side note: even if CRC matches, it doesn’t ensure that recovered data is the same as data originally written. This is why filesystems that keep sophisticated checksums of files are getting popular (zfs to name one).

    Just my $0.02.

    Valeri

    ++++++++++++++++++++++++++++++++++++++++
    Valeri Galtsev Sr System Administrator Department of Astronomy and Astrophysics Kavli Institute for Cosmological Physics University of Chicago Phone: 773-702-4247
    ++++++++++++++++++++++++++++++++++++++++

  • This is not a given. Misconfiguration can make persistent bad sectors very common, and this misconfiguration is the default situation in RAID setups on Linux, which is why it’s so common. This, and user error, are the top causes for RAID 5 implosion on Linux (both mdadm and lvm raid). The necessary sequence:

    1. The drive needs to know the sector is bad.
    2. The drive needs to be asked to read that sector.
    3. The drive needs to give up trying to read that sector.
    4. The drive needs to report the sector LBA back to the OS.
    5. The OS needs to write something back to that same LBA.
    6. The drive will write to the sector, and if it fails, will remap the LBA to a different (reserve) physical sector.

    Where this fails on Linux is step 3 and 4. By default consumer drives either don’t support SCT ERC, such as in the case in this thread, or it’s disabled. That condition means the time out for deep recovery of bad sectors can be very high, 2 or 3 minutes. Usually it’s less than this, but often it’s more than the kernel’s default SCSI command timer. When a command to the drive doesn’t complete successfully in the default of 30 seconds, the kernel resets the link to the drive, which obliterates the entire command queue contents and the work it was doing to recover the bad sector. Therefore step 4 never happens, and no steps after that either.

    Hence, bad sectors accumulate. And the consequence of this often doesn’t get figured out until a user looks at kernel messages and sees a bunch of hard link resets and has a WTF moment, and asks questions. More often they don’t see those reset messages, or they don’t ask about them, so the next consequence is a drive fails. When it’s a drive other than one with bad sectors, in effect there are two bad strips per stripe during reads (including rebuild) and that’s when there’s total array collapse even though there was only one bad drive. As a mask for this problem people are using RAID 6, but it’s still a misconfiguration that can cause RAID6 failures too.

    Nope. The drive isn’t being asked to write to those bad sectors. If it can’t successfully read the sector without error, it won’t migrate the data on its own (some drives never do this). So it necessitates a write to the sector to cause the remap to happen.

    The other thing is the bad sector count on 512e AF drives is inflated. The number of bad sectors is in 512 byte sector increments. But there is no such thing on an AF drive. One bad physical sector will be reported as 8 bad sectors. And to fix the problem it requires writing exactly all 8 of those logical sectors at one time in a single command to the drive. Ergo I’ve had ‘dd if=/dev/zero of=/dev/sda seek=blah count=8’ fail with a read error, due to the command being internally reinterpreted as read-modify-write. Ridiculous but true. So you have to use bs@96 and count=1, and of course adjust seek LBA to be based on 4096 bytes instead of 512.

    So the simplest fix here is:

    echo 160 /sys/block/sdX/device/timeout/

    That’s needed for each member drive. Note this is not a persistent setting. And then this:

    echo repair > /sys/block/mdX/md/sync_action

    That’s once. You’ll see the read errors in dmesg, and md writing back to the drive with the bad sector.

    This problem affects all software raid, including btrfs raid1. The ideal scenario is you’ll use ‘smartctl -l scterc,70,70 /dev/sdX’ in startup script, so the drive fails reads on marginally bad sectors with an error in 7 seconds maximum.

    The linux-raid@ list if chock full of this as a recurring theme.

  • I remember a long time ago – that actually was in the country “Far -Far Away” ;-) We were not allowed to dispose of connectors with gold plated contacts. These were collected, and gold was extracted from them and re-used. I believe, there were dissolving base brass material with acid, then just melted the thin gold shells left. Not useful with modern super thin plating.

    ++++++++++++++++++++++++++++++++++++++++
    Valeri Galtsev Sr System Administrator Department of Astronomy and Astrophysics Kavli Institute for Cosmological Physics University of Chicago Phone: 773-702-4247
    ++++++++++++++++++++++++++++++++++++++++

  • Kids these days! I remember taking the vacuum tubes to the testing centre in the corner drug-store to see which ones need replacing.

    Apologies to the four Yorkshiremen.

  • The standard Unix way of refreshing the disk contents is with badblocks’
    non-destructive read-write test (badblocks -n or as the -cc option to e2fsck, for ext2/3/4 filesystems). The remap will happen on the writeback of the contents. It’s been this way with enterprise SCSI
    drives for as long as I can remember there being enterprise-class SCSI
    drives. ATA drives caught up with the SCSI ones back in the early 90’s with this feature. But it’s always been true, to the best of my recollection, that the remap always happens on a write. The rationale is pretty simple: only on a write error does the drive know that it has the valid data in its buffer, and so that’s the only safe time to put the data elsewhere.

    This is partly why enterprise arrays manage their own per-sector ECC and use 528-byte sector sizes. The drives for these arrays make very poor workstation standalone drives, since the drive is no longer doing all the error recovery itself, but relying on the storage processor to do the work. Now, the drive is still doing some basic ECC on the sector, but the storage processor is getting a much better idea of the health of each sector than when the drive’s firmware is managing remap.
    Sophisticated enterprise arrays, like NetApp’s, EMC’s, and Nimble’s, can do some very accurate predictions and proactive hotsparing when needed.
    That’s part of what you pay for when you buy that sort of array.

    But the other fact of life of modern consumer-level hard drives is that
    *errored sectors are expected* and not exceptions. Why else would a drive have a TLER in the two minute range like many of the WD Green drives do? And with a consumer-level drive I would be shocked if badblocks reported the same number each time it ran through.

  • As long as the DIMM isn’t populated with BGA packages it’s about a ten-minute job with a hot air rework station, which will only cost you around $100 or so if you shop around (and if you have a relatively steady hand and either good eyes or a good magnifier). It’s doable in a DIY way even with BGA, but takes longer and you need a reballing mask for that specific package to make it work right. Any accurately controlled oven is good enough to do the reflow (and baking Xbox boards is essentially doing a reflow……)

    Yeah, I prefer tubes and discretes and through-hole PCB’s myself, but at this point I’ve acquired a hot air station and am getting up to speed on surface mount, and am finding that it’s not really that hard, just different.

    This is not that different from getting up to speed with something really new and different, like systemd. It just requires being willing to take a different approach to the problem. BGA
    desoldering/resoldering requires a whole different way of looking at the soldering operation, that’s all.

  • This isn’t applicable to RAID, which is what this thread is about. For RAID, use scrub, that’s what is for.

    The badblocks method fixes nothing if the sector is persistently bad and the drive reports a read error. It fixes nothing if the command timeout is reached before the drive either recovers or reports a read error. And even if it works, you’re relying on ECC recovered data rather than reading a likely good copy from mirror or parity and writing that back to the bad block.

    But all of this still requires the proper configuration.

    The remap will happen on the

    Properly configured, first a read error happens which includes the LBA of the bad sector. The md driver needs that LBA to know how to find a good copy of data from mirror or from parity. *Then* it weird to the bad LBA.

    In the case of misconfiguration, the command timeout expiration and link reset prevents the kernel from knowing the LBA if the bad sector and therefore repair isn’t possible.

    The rationale

    Not all enterprise drives have 520/528 byte sectors. Those that do are using T10-PI (formerly DIF) and it requires software support too. It’s pretty rare. It’s 8000% easier to use ZFS on Linux or Btrfs.

    All drives expect bad sectors. Consumer drives reporting a read error will put the host OS into an inconsistent state, so it should be avoided. Becoming slow is better than implosion. And neither OS X or Windows do link resets after merely 30 seconds either.

    Chris Murphy

  • The badblocks read/write verification would need to be done on the RAID
    member devices, not the aggregate md device, for member device level remap. It might need to be done with the md offline, not sure. Scrub?
    There is a scrub command (and package) in CentOS, but it’s meant for secure data erasure, and is not a non-destructive thing. Ah, you’re talking about what md will do if ‘check’ or ‘repair’ is written to the appropriate location in the sysfs for the md in question. (This info is in the md(4) man page).

    The badblocks method will do a one-off read/write verification on a member device; no, it won’t do it automatically, true enough.

    Very true.

    Yes, for the member drive this is true. Since my storage here is primarily on EMC Clariion, I’m not sure what the equivalent to EMC’s background verify would be for mdraid, since I’ve not needed that functionality from mdraid. (I really don’t like the term ‘software RAID’ since at some level all RAID is software RAID, whether on a storage processor or in the RAID controller’s firmware…..). It does appear that triggering a scrub from sysfs for a particular md might be similar functionality, and would do the remap if inconsistent data is found. This is a bit different from the old Unix way, but these are newer times and such the way of doing things is different.

    Yes, this is very true.