Disk Near Failure

Home » CentOS » Disk Near Failure
CentOS 20 Comments

Hi list, on my workstation I’ve a md raid (mirror) for / on md1. This raid has 2
ssd as members (each corsair GT force 120GB MLC). This disks are ~ 5
years old. Today I’ve checked my ssds smart status and I get:

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED
WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 100 100 050 Pre-fail Always
– 0/4754882
5 Retired_Block_Count 0x0033 100 100 003 Pre-fail Always
– 0
9 Power_On_Hours_and_Msec 0x0032 000 000 000 Old_age Always
– 17337h+11m+24.440s
12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always
– 1965
171 Program_Fail_Count 0x0032 000 000 000 Old_age Always
– 0
172 Erase_Fail_Count 0x0032 000 000 000 Old_age Always
– 0
174 Unexpect_Power_Loss_Ct 0x0030 000 000 000 Old_age Offline
– 780
177 Wear_Range_Delta 0x0000 000 000 000 Old_age Offline
– 3
181 Program_Fail_Count 0x0032 000 000 000 Old_age Always
– 0
182 Erase_Fail_Count 0x0032 000 000 000 Old_age Always
– 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always
– 0
194 Temperature_Celsius 0x0022 033 042 000 Old_age Always
– 33 (Min/Max 15/42)
195 ECC_Uncorr_Error_Count 0x001c 120 120 000 Old_age Offline
– 0/4754882
196 Reallocated_Event_Count 0x0033 100 100 003 Pre-fail Always
– 0
201 Unc_Soft_Read_Err_Rate 0x001c 120 120 000 Old_age Offline
– 0/4754882
204 Soft_ECC_Correct_Rate 0x001c 120 120 000 Old_age Offline
– 0/4754882
230 Life_Curve_Status 0x0013 100 100 000 Pre-fail Always
– 100
231 SSD_Life_Left 0x0013 100 100 010 Pre-fail Always
– 0
233 SandForce_Internal 0x0000 000 000 000 Old_age Offline
– 6585
234 SandForce_Internal 0x0032 000 000 000 Old_age Always
– 6885
241 Lifetime_Writes_GiB 0x0032 000 000 000 Old_age Always
– 6885
242 Lifetime_Reads_GiB 0x0032 000 000 000 Old_age Always
– 6244

The second ssd has very similar values.

SSD_Life_Left is 0 since ~ 1years ago for each ssd. Today these disks are working without problems:

# hdparm -tT /dev/md1

/dev/md1:
Timing cached reads: 26322 MB in 2.00 seconds = 13181.10 MB/sec
Timing buffered disk reads: 1048 MB in 3.00 seconds = 349.00 MB/sec

# hdparm -tT /dev/sda

/dev/sda:
Timing cached reads: 26604 MB in 2.00 seconds = 13322.82 MB/sec
Timing buffered disk reads: 1140 MB in 3.00 seconds = 379.87 MB/sec

# hdparm -tT /dev/sdb

/dev/sdb:
Timing cached reads: 26258 MB in 2.00 seconds = 13148.38 MB/sec
Timing buffered disk reads: 1140 MB in 3.00 seconds = 379.70 MB/sec

# dd if=/dev/zero of=file count 00000
2000000+0 record in
2000000+0 record out
1024000000 byte (1,0 GB) copied, 2,36335 s, 433 MB/s

My ssds are failing?

Thanks in advance.

20 thoughts on - Disk Near Failure

  • SSD’s wear out based on writes per block. they distribute those writes, but once each block has been written X number of times, they are no longer reliable.

    they appear to still be working perfectly, but they are beyond their design life. soon or later, if you continue the amount of writes you’ve been doing, you’ll get back errors or bad data.

    I would plan on replacing those drives sooner rather than later. 5
    years was a good run.

  • Hello Alessandro,

    smartctl -A only show a total error count for my disks, but I suppose this means 0 errors on 4754882 reads…

    Note that the “Pre-fail” does not indicate that your disk is about to fail, it is an indication of the type of is issue that causes this particular class of errors.

    No retired blocks, that seems alright…

    The easiest way to test for disk errors is by issuing

    smartctl -l xerror /dev/sda

    If the output contains “No Errors Logged” your disks are fine.

    Quite easy to put this in a (daily) cron job that greps the output of smartctl for that string and if it does not find a match sends a mail warning you about those disk errors.

    #!/bin/bash

    SMARTCTL=/usr/sbin/smartctl GREP=/bin/grep

    DEVICES=’sda sdb’
    HOST=’hostname’
    TO=’a@example.com’
    CC=’b@example.com’

    for d in $DEVICES ; do
    if [ “$($SMARTCTL -l xerror /dev/$d | $GREP No\ Errors\ Logged)” == ” ]; then
    # ERRORS FOUND
    $SMARTCTL -x /dev/$d | mail -c $CC -s “$HOST /dev/$d SMART errors” $TO
    fi done

    Regards, Leonard.

  • John R Pierce wrote:

    1. Especially if they’re consumer grade.
    2. And that’s a fairly early large (for SSD) drive.
    3. We’ve got a RAID appliance that takes actual SCSI that’s still running, though we’re now in the process of replacing these 10 yr old RAIDs….
    4. SATA is a *lot* cheaper for *much* larger capacity drives…

    mark

  • Il 21/10/2016 17:20, m.roth@5-cent.us ha scritto:
    Hey there, I’ve runned smartctl -l xerror/error /dev/sda but I get:

    smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.10.0-327.36.2.el7.x86_64]
    (local build)
    Copyright (C) 2002-13, Bruce Allen, Christian Franke, http://www.smartmontools.org

    === START OF READ SMART DATA SECTION ==SMART Error Log not supported

    I’ve noticed this also with smartctl -a /dev/sda

  • Hi,

    I reckon there’s a between those lines. The line right after the first should read something like:

    SMART overall-health self-assessment test result: PASSED

    or “FAILED” for that matter. If not try running

    smartctl -t short /dev/sda

    , wait for the indicated time to expire, then check the output of smartctl -a (or -x) again.

    Regards, Leonard.

  • Il 24/10/2016 14:05, Leonard den Ottolander ha scritto:
    Hi Leonard, after a smart short test, the output of smartctl -a /dev/… is

    === START OF INFORMATION SECTION ==Model Family: SandForce Driven SSDs Device Model: Corsair Force GT
    Serial Number: 12297948000015020A81
    LU WWN Device Id: 0 000000 000000000
    Firmware Version: 5.02
    User Capacity: 120,034,123,776 bytes [120 GB]
    Sector Size: 512 bytes logical/physical Rotation Rate: Solid State Device Device is: In smartctl database [for details use: -P show]
    ATA Version is: ATA8-ACS, ACS-2 T13/2015-D revision 3
    SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
    Local Time is: Thu Oct 27 11:22:22 2016 CEST
    SMART support is: Available – device has SMART capability. SMART support is: Enabled

    === START OF READ SMART DATA SECTION ==SMART overall-health self-assessment test result: PASSED

    General SMART Values:
    Offline data collection status: (0x02) Offline data collection activity
    was completed without error.
    Auto Offline Data Collection:
    Disabled. Self-test execution status: ( 0) The previous self-test routine completed
    without error or no self-test has ever
    been run. Total time to complete Offline data collection: ( 0) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate.
    Auto Offline data collection on/off support.
    Suspend Offline collection upon new
    command.
    Offline surface scan supported.
    Self-test supported.
    Conveyance Self-test supported.
    Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering
    power-saving mode.
    Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported.
    General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 48) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x0021) SCT Status supported.
    SCT Data Table supported.

    SMART Attributes Data Structure revision number: 10
    Vendor Specific SMART Attributes with Thresholds:
    ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
    UPDATED WHEN_FAILED RAW_VALUE
    1 Raw_Read_Error_Rate 0x000f 120 120 050 Pre-fail Always – 0/0
    5 Retired_Block_Count 0x0033 100 100 003 Pre-fail Always – 0
    9 Power_On_Hours_and_Msec 0x0032 000 000 000 Old_age Always – 17394h+07m+56.840s
    12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always – 1974
    171 Program_Fail_Count 0x0032 000 000 000 Old_age Always
    – 0
    172 Erase_Fail_Count 0x0032 000 000 000 Old_age Always
    – 0
    174 Unexpect_Power_Loss_Ct 0x0030 000 000 000 Old_age Offline – 780
    177 Wear_Range_Delta 0x0000 000 000 000 Old_age Offline – 3
    181 Program_Fail_Count 0x0032 000 000 000 Old_age Always
    – 0
    182 Erase_Fail_Count 0x0032 000 000 000 Old_age Always
    – 0
    187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always
    – 0
    194 Temperature_Celsius 0x0022 029 042 000 Old_age Always
    – 29 (Min/Max 15/42)
    195 ECC_Uncorr_Error_Count 0x001c 100 100 000 Old_age Offline – 0/0
    196 Reallocated_Event_Count 0x0033 100 100 003 Pre-fail Always
    – 0
    201 Unc_Soft_Read_Err_Rate 0x001c 100 100 000 Old_age Offline – 0/0
    204 Soft_ECC_Correct_Rate 0x001c 100 100 000 Old_age Offline – 0/0
    230 Life_Curve_Status 0x0013 100 100 000 Pre-fail Always
    – 100
    231 SSD_Life_Left 0x0013 100 100 010 Pre-fail Always
    – 0
    233 SandForce_Internal 0x0000 000 000 000 Old_age Offline – 6599
    234 SandForce_Internal 0x0032 000 000 000 Old_age Always
    – 6894
    241 Lifetime_Writes_GiB 0x0032 000 000 000 Old_age Always
    – 6894
    242 Lifetime_Reads_GiB 0x0032 000 000 000 Old_age Always
    – 6326

    SMART Error Log not supported

    SMART Self-test Log not supported

    SMART Selective self-test log data structure revision number 1
    SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
    1 0 0 Not_testing
    2 0 0 Not_testing
    3 0 0 Not_testing
    4 0 0 Not_testing
    5 0 0 Not_testing Selective self-test flags (0x0):
    After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.

  • Hi,

    That’s the line you are looking for. Since your disk apparently does not store an error log – not sure if that’s something with SSDs in general or just with this particular disk – you will always have to invoke

    smartctl -t short /dev/sda

    and then after the test has completed check the output of

    smartctl -a /dev/sda

    for that particular line. Shouldn’t be too hard to put in a cron job, just make sure the job waits long enough (more than 1 minute, make it 2
    to be sure) with reading the output of smartctl -a after invoking smartctl -t short.

    Regards, Leonard.

  • Il 27/10/2016 13:58, Leonard den Ottolander ha scritto:
    thank you for suggestion.

    Alessandro.

  • You can also use the service smartd and edit the smartd.conf file and it have it send you emails when a disk starts to fail.

  • Hmm, lets do some math:
    17394 hours “on”-time equals 724.7 days (at continous “on”).
    6894 GiB written at 120 GiB drive sizes gives 57.4 Drive-Writes
    (at optimal wearleveling every cell would have been written 57-58 times)

    The used Sandforce controller (likly a SF-2281) is not the best at wearleveling, so the “use”-count per cell will be most likely more than double that.

    For my personal use I would replace that Drive asap.
    – There is no warranty for it anymore (time since buy)
    – You can’t buy it new anymore (discontinued)
    – There are more reliable drives available.

    I’d go for a Samsung Evo 850, that will give you five years of warranty.

    But, it’s your drive, you make the decissions.

    – Yamaban.

  • Il 27/10/2016 19:38, Yamaban ha scritto:

    Thank you for your suggestion.

    What do you think about Corsair Neutron XTi 240 MLC?

  • Amazing. He suggested you definitely reliable drive (Samsung). Reliable in my boot too. You ask his opinion about yet another Corsair. One by Corsair failed on you already. So, you should have better knowledge about Corsair’s SSD reliability, right?

    Sorry to sound sour, it just amuses me how people keep buying things made by the same company whose products already failed on them. This is what creates the problem: keeps companies manufacturing bad hardware exist.

    Valeri

    ++++++++++++++++++++++++++++++++++++++++
    Valeri Galtsev Sr System Administrator Department of Astronomy and Astrophysics Kavli Institute for Cosmological Physics University of Chicago Phone: 773-702-4247
    ++++++++++++++++++++++++++++++++++++++++

  • Il 28/10/2016 16:28, Valeri Galtsev ha scritto:

    Sorry, but my 2 ssds corsair does not report error and works fine, with good performances and without realloc. These disks are not failed. Yes, they are failing but these are old driver and this is a desktop under raid. Consider that these drive are 5 years old, for me this is not bad ssd brand, there are best brand but corsair is not too bad.

    Now, Yamaban had suggested samsung because this is the best choice. This does not exclude that there are other products (that can be less reliable and less performant at lower cost) that for my case are good enough. Corsair neutron has also 5 years of warrenty.

    > Sorry to sound sour, it just amuses me how people keep buying things made
    > by the same company whose products already failed on them. This is what
    > creates the problem: keeps companies manufacturing bad hardware exist.
    >

    If you are AMD user and your old AMD cpu died, You think that AMD must burn due to a cpu failure? Great. I’m with you in the case where you buy a disk and after 3/6 months it fails (and this can happen also with very good brand) and this is not the case. Backblaze must burn all brand because many disks fails….

    Now about bad hardware manufacturing companies it’s another problem. These companies point to low cost consumer, due the fact that not anyone can get the best hardware due to money. An example? Corsair LE 480 GB
    (100$) vs Samsung SSD Serie 850 Pro 512GB (260$). 850 Pro is better, but more expensive, and Corsair LE has 3 year of warrenty. Maybe an user can spend his money for a vga or a better cpu. These bad companies permit some users to get hw for less money without a great expecation for cheapest use case and their ability to pay.

    Than if these cheap companies must not exist, the user must not use a new technology (at lower cost)? The IT gap.

    Sorry, my (m.)2 cents.

  • Yes, indeed, I’m with you on that. Market is driven by low budget
    (ignorant – not to offend, but to just qualify in insight into hardware)
    consumer. Which indeed leads to “fake raid” chips (aka “software” raid), and many other bad things. I sometimes have to deal with what students have ordered themselves. Hence excessive attitude. As they order before they hear from me: “pricegrabber is an enemy in choosing reliable hardware”. Then all leads to downtime, someone has to spend time on repairing the darn thing. Whereas, if one pays mere 15% more and gets good hardware, future losses (including human time which is very expensive) can be avoided. Alas, SSD difference in hand is larger that 15%, hence probably nobody will dare to help with advice. If there is good advice that is. I for one did go with Samsung SSD…

    Valeri

    ++++++++++++++++++++++++++++++++++++++++
    Valeri Galtsev Sr System Administrator Department of Astronomy and Astrophysics Kavli Institute for Cosmological Physics University of Chicago Phone: 773-702-4247
    ++++++++++++++++++++++++++++++++++++++++

  • For me the answer for this is: use what you need. For example consider raid. On my desktop I have an mdadm raid level mirror. It is only a desktop used for some task at home (testing, coding…). Why buy a valid controller like areca or (as suggested on a discussion, maybe on reddit) an HBA to make a simple raid? I don’t need an HBA or an high value controller on my i7-2600k. Then the consideration should be “if you need high disk I/O
    perfomances and a lot of space buy the right hardware.”

    For example there are a great number of small offices that need of little nas. There are cheap products that can perform well this operation but these are not valid hardware. I’m not a fan of this solution types but many technician install them because the committent says “oh please, drop the price”. If for a small office, a technician must get a 1000/1500 $ for a server to serve as nas, he will not work. I have seen this type of product on lan with 120 hosts with a deadly performances.

    Il 28/ott/2016 19:15, “Valeri Galtsev” ha scritto:

  • Hi Yamaban, Great expalanation. I think you know how to buy an ssd. There is no doubt about samsung ssds quality vs other. My question about neutron was to get your opinion about this product.

    My doubt was about differences between slc, mlc and tlc. Mlc endurance respect tlc is better and I though that the mlc of neutron gives me more endurance respect to the tlc. From a technic point of view, why the samsung tlc is better of corsair mlc? And what about v nand? Have you used it?

    Thanks in advances

    Il 28/ott/2016 20:33, “Yamaban” ha scritto:
    Reliable in Corsair made good performances and without realloc. These disks are not failed. Yes, they are failing but these are old driver and this is a desktop under raid. Consider that these drive are 5 years old, for me this is not bad ssd brand, there are best brand but corsair is not too bad. does not exclude that there are other products (that can be less reliable and less performant at lower cost) that for my case are good enough. Corsair neutron has also 5 years of warrenty. made burn due to a cpu failure? Great. fails (and this can happen also with very good brand) and this is not the case. Backblaze must burn all brand because many disks fails…. These companies point to low cost consumer, due the fact that not anyone can get the best hardware due to money. An example? Corsair LE 480 GB
    (100$) vs Samsung SSD Serie 850 Pro 512GB (260$). 850 Pro is better, but more expensive, and Corsair LE has 3 year of warrenty. Maybe an user can spend his money for a vga or a better cpu. These bad companies permit some users to get hw for less money without a great expecation for cheapest use case and their ability to pay. new technology (at lower cost)? The IT gap. one. does. CentOS mailing list CentOS@CentOS.org https://lists.CentOS.org/mailman/listinfo/CentOS

  • [snip]

    Hi Alessandro,

    For the clear picture, if I’m talking about “Corsair SSD” I mean the
    “Corsair Neutron XTi”, because the “Corsair Force LE” is pretty much a no-go for anyone that has to rely on the data stored for more than 3 years at a work load of 9 hours per day / 5 days a week / 50 weeks a year
    (ca 2250 hours per year) at ca 8TBW written per year.

    I’m not take these numbers out of the air, but that is what an normal office PC is based on. Those 8TBW per year come from observation on Microsoft Windows 10 Profesional and latest Microsoft Office Professional and include nearly half system / half user caused writes on average per year. Those Microsoft updates and shadow-copies are much more heavy than most people thought.

    Thankfully most Linux-Distros cause a much lighter system part of the Write load of the drive than Windows, but COW based file-systems like btrfs are on the uptake and that will rise the write load.

    Now, on Flash Technology. Hmm. I’ve started on that with UV-Erasible E-Prom in 1987 (100 Erase cycles), went on with EE-Prom (over 10.000 Erase cycles!, but only 10 years data retention), and near 1990 Flash-EEProm (Block-wise erasible) became available at prices a student could pay from his/her spending money

    The writes on early Flash where painfully slow, about 1% of the read-speed, at the beginning. the more wide-spread usage
    (in digital still-cameras and mobile phones) brought a (slow)
    change to more write speed, but at what cost? Data Retention Time!

    [… long rant removed, its late in the (not so pleasan)t day …]

    On the difference of nand and v-nand: “normal” nand uses “floating gate”
    while v-nand aka “vertical-nand” uses a “charge trap” (capacitor) to store the bit information.

    The Wikipedia article on Flash gives some more indepth info:
    Flash : https://en.wikipedia.org/wiki/Flash_memory MLC/TLC: https://en.wikipedia.org/wiki/Multi-level_cell

    Conclusion: a well produced (first class / datacenter class) TLC Nand is very similar to a middle class MLC V-Nand, both in terms of access speed and write endurance. But the MLC will be at least 10% bigger on the die. ATM, it is a cost balance between a lower yield on high quality smaller TLC, and higher yield middle class bigger MLC

    So, for the End user wheter “MLC V-nand” or “TLC nand” is much less interresting than the question of “how well does the manufacturer understand the used flash and how well was the controller adapted to it”

    Corsair as a SSD manufacturer buys both, the flash, and the controller, from other manufacturers, while Samsung does it completely in-house.

    Thus it is not surprising that Corsair does still use the MLC technology while Samsung has already made the step to TLC.

    I see that as a unspoken statement from Corsair that “we do not have to knowlegde avaliable (atm.) to make a TLC drive of the same quality than a MLC one.” That is not negative in any way. A manufacturer that knows his limits is much better than one that jumps on a new hype with to little knowlegde.

    Samsung is very careful about its promises on write endurance. TLC is still a young technology and that shows in lower TBW, so the warranty says for the “Evo” says:
    “5 years or TBW per spec, what ever is reached first”.

    That’s honesty in my eyes.

    If the question would be the “Samsung 850 EVO” with MLC Flash from last year, or the new “Corsair Neutron XTi” there would be little to no difference in TBW, but the price of the Samsung was ca 10% higher.

    IIRC, the TBW spec from you old 120GiB Corsair was below 10TBW, and you are nearly on the 7TBW mark after 5 years, even the 75TBW of the 250GiB
    Samsung should hold out for the next 5 years.

    My baseline is: wether the “Samsung 850 EVO” with TLC or the
    “Corsair Neutron XTi” with MLC, is more a matter of gut-feeling than anything else. As you will not buy the SSD in packs of 20 or more you never get into any discount scheme, so that offer from Samsung will also not matter in any way, and at the point where you buy it, the price per GiB will be nearly equal. Both offer 5 years warranty.

    Have a nice weekend
    – Yamaban

  • Hello Valeri,

    It did not. He asked whether it did but there is no indication it is near failure and definitely hasn’t failed yet.

    Can you provide us with links indicating the unreliability of (any or all) Corsair SSDs? I’ve had IBM Deskstars failing on me prematurely. Does that disqualify IBM as a producer of hard drives? (Ok, they are no longer in that business, but that’s not the point I’m trying to make.)
    And not so long ago I had problems with my internet provider buying a batch of lousy Seagates that kept failing. Does that disqualify all Seagate disks?

    Regards, Leonard.

  • Hello Yamaban,

    I fail to see how that is relevant… If you lose your data because of a failing disk you lose your data. Whether or not you get a replacement drive does not change that fact. The length of a warranty might be an indication of the expected life of a product, but it says nothing about the state of one individual drive.

    Again, relevance?

    Still no argument to replace an existing working one… And as I asked Valeri, can you please provide us with links indicating the poor quality of Corsair SSDs (in general)?

    I do not know how SSDs fail, but when regular HDs start to fail you usually have some time to get a replacement before they fail altogether. I would expect the number of reallocated sectors to increase, but still have a little time to replace the disk once that happens. And supposing the disk actually does store the number of “retired” blocks this disk seems fine:

    Regards, Leonard.

  • Hi Yamaban,

    Well never mind that request, very interesting to read your lectures later in this thread :) .

    Regards, Leonard.