Errors On An SSD Drive

Home » CentOS » Errors On An SSD Drive
CentOS 29 Comments

I am building a new system using an Kingston 240GB SSD drive I pulled from my notebook (when I had to upgrade to a 500GB SSD drive). CentOS
install went fine and ran for a couple days then got errors on the console. Here is an example:

[168176.995064] sd 0:0:0:0: [sda] tag#14 FAILED Result:
hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[168177.004050] sd 0:0:0:0: [sda] tag#14 CDB: Read(10) 28 00 01 04 68 b0
00 00 08 00
[168177.011615] blk_update_request: I/O error, dev sda, sector 17066160
[168487.534510] sd 0:0:0:0: [sda] tag#17 FAILED Result:
hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[168487.543576] sd 0:0:0:0: [sda] tag#17 CDB: Read(10) 28 00 01 04 68 b0
00 00 08 00
[168487.551206] blk_update_request: I/O error, dev sda, sector 17066160
[168787.813941] sd 0:0:0:0: [sda] tag#20 FAILED Result:
hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[168787.822951] sd 0:0:0:0: [sda] tag#20 CDB: Read(10) 28 00 01 04 68 b0
00 00 08 00
[168787.830544] blk_update_request: I/O error, dev sda, sector 17066160

Eventually, I could not do anything on the system. Not even a
‘reboot’. I had to do a cold power cycle to bring things back.

Is there anything to do about this or trash the drive and start anew?

Thanks

29 thoughts on - Errors On An SSD Drive

  • If it’s a bad sector problem, you’d write to sector 17066160 and see if the drive complies or spits back a write error. It looks like a bad sector in that the same LBA is reported each time but I’ve only ever seen this with both a read error and a UNC error. So I’m not sure it’s a bad sector.

    What is DID_BAD_TARGET?

    And what do you get for smartctl -x

    Chris Murphy

  • To be honest, I’d not try a btrfs volume on a notebook SSD. I did that on a couple of systems and it corrupted pretty quickly. I’d stick with xfs/ext4
    if you manage to get the drive working again.

    <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=icon>
    Virus-free. http://www.avast.com
    <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=link>
    <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>


    [image: photo]
    Mark Haney Network Engineer at NeoNova
    919-460-3330 <(919)%20460-3330> (opt 1) • mark.haney@neonova.net http://www.neonova.net <https://neonova.net/>
    <https://www.facebook.com/NeoNovaNNS/> <https://twitter.com/NeoNova_NNS>
    <http://www.linkedin.com/company/neonova-network-services>

  • I have yet to see a SSD read\write error which wasn’t related to disk issues like a bad sector but the controller might have an issue with the drive. To verify it you will need to burn some read\write IOPS of the drive but if it’s under warranty then it’s better to verify it now then later.

    Eliezer

  • what file system are you using?  ssd drives have different characteristics that need to be accomadated (including a relatively slow write process which is obvious as soon as the buffer is full), and never, never put a swap partition on it, the high activity will wear it out rather quickly.  might also check cables, often a problem particularly if they are older sata cables being run at a possibly higher than rated speed.  in any case, reformating it might not be a bad idea, and you can always use the command line program badblocks to exercise and test it.  keep in mind the drive will invisibly remap any bad sectors if possible.  if the reported size of the drive is smaller than it should be the drive has run out of spare blocks and dying blocks are being removed from the storage place with no replacements.


    Securely sent with Tutanota. Claim your encrypted mailbox today!
    https://tutanota.com

    9. Aug 2017 18:44 by eliezer@ngtech.co.il:

  • I know this is common doctrine, but is this still generally held true?

    For a well configured desktop that rarely needs to swap, I struggle to see the load on the SSD as being significant, and yet obviously the performance of an SSD would make it ideal for swap.

    Exercising an SSD?

    smartctl will give you sensible information on what the drive thinks of itself, and will give you actual figures on wear levelling and such like.

    Coo, I’ve never seen a disk actually shrink due to failed sectors. I don’t think I’ve got an SSD into a worn state yet to see this.

    jh

  • I have no experience on how to force a write to a specific sector and not cause other problems. I suspect that this sector is in the / partition:

    Disk /dev/sda: 240.1 GB, 240057409536 bytes, 468862128 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk label type: dos Disk identifier: 0x0000c89d

    Device Boot Start End Blocks Id System
    /dev/sda1 2048 2099199 1048576 83 Linux
    /dev/sda2 2099200 4196351 1048576 82 Linux swap / Solaris
    /dev/sda3 4196352 468862127 232332888 83 Linux

    But I don’t know where it is in relation to the way the drive was formatted in my notebook. I think it would have been in the / partition.

    About 17KB of output? I don’t know how to read what it is saying, but noted in the beginning:

    Write SCT (Get) XXX Error Recovery Control Command failed: scsi error badly formed scsi parameters

    Don’t know what this means…

    BTW, the system is a Cubieboard2 armv7 SoC running CentOS7-armv7hl. This is the first time I have used an SSD on a Cubie, but I know it is frequently done. I would have to ask on the Cubie forum what others experience with SSDs have been.

  • This is a CentOS7-armv7hl install which is done by dd the provided image onto a drive, so really can’t alter the provided file systems much other than to resize them. What I have is:

    Model: ATA KINGSTON SV300S3 (scsi)
    Disk /dev/sda: 240GB
    Sector size (logical/physical): 512B/512B
    Partition Table: msdos Disk Flags:

    Number Start End Size Type File system Flags
    1 1049kB 1075MB 1074MB primary ext3
    2 1075MB 2149MB 1074MB primary linux-swap(v1)
    3 2149MB 240GB 238GB primary ext4

  • Robert Moskowitz wrote:

    Here’s a thought: I’ve not done this, but could you use smartctl to check the drive?

    mark

  • Mark Haney wrote:

    That was merely to see if a trim operation on the whole device would bring some improvement.

    I have the system on SSDs at home and data on spinning disks, so far no problems with btrfs. Do I need to worry now?

  • if you manage to get the drive working again. Sounds like a hardware problem. Btrfs is explicitly optimized for SSD, the maintainers worked for FusionIO for several years of its development. If the drive is silently corrupting data, Btrfs will pretty much immediately start complaining where other filesystems will continue. Bad RAM can also result in scary warnings where you don’t with other filesytems. And I’ve been using it in numerous SSDs for years and NVMe for a year with zero problems.

    On CentOS though, I’d get newer btrfs-progs RPM from Fedora, and use either an elrepo.org kernel, a Fedora kernel, or build my own latest long-term from kernel.org. There’s just too much development that’s happened since the tree found in RHEL/CentOS kernels.

    Also FWIW Red Hat is deprecating Btrfs, in the RHEL 7.4 announcement. Support will be removed probably in RHEL 8. I have no idea how it’ll affect CentOS kernels though. It will remain in Fedora kernels.

    Anyway, blkdiscard can be used on an SSD, whole or partition to zero them out. And at least recent ext4 and XFS mkfs will do a blkdisard, same as mksfs.btrfs.

    Chris Murphy

    CentOS mailing list CentOS@CentOS.org https://lists.CentOS.org/mailman/listinfo/CentOS

  • LBA 17066160 would be on sda3.

    dd if=/dev/sda skip066160 count=1 2>/dev/null | hexdump -C

    That’ll read that sector and display hex and ascii. If you recognize the contents, it’s probably user data. Otherwise, it’s file system metadata or a system binary.

    If you get nothing but an I/O error, then it’s lost so it doesn’t matter what it is, you can definitely overwrite it.

    dd if=/dev/zero of=/dev/sda seek066160 count=1

    If you want an extra confirmation, you can first do ‘smartctl -t long
    /dev/sda’ and then after the prescribed testing time, which is listed, check it again with ‘smartct -a /dev/sda’ and see if the test completed, or if under self-test log section, it shows it was aborted and lists a number under the LBA_of_first_error column.

    Can you attach it as a file to the list? If the list won’t accept the attachment, put it up on fpaste.org or pastebin or something like that. MUA’s tend to nerf the output so don’t paste it into an email.

    It’s very common. I think this is just an ordinary bad sector, if that LBA
    value is consistent. If it’s a new SSD it’s slightly concerning. You can either keep an eye on it, or put a little pressure on the manufacturer or place of purchase that you have a bad sector and would like to swap out the unit.

    SSD’s, in particular SD Cards (which you’re not using, which is noted as
    /dev/mmcblk0…) store you data as a probabilistic representation, and through a lot of magic, the probability of retrieving your data correctly from SSD is made very high. Almost deterministic.

    The magic is in the firmware, and so there’s some possibility any given SSD
    problem is related to a firmware bug. So it’s worth comparing the firmware reported by smartctl and what the manufacturer has, and then their changelog. Most have a way to update firmware without Windows, but don’t have images that will boot an arm board, usually the “universal” updater is based on FreeDOS funny enough. You’d need to stick the SSD in an x86
    computer to do this. Hilariously perverse, I did this with a Samsung 830
    SSD a while back, sticking it into a Macbook Pro, and burned that firmware ISO onto a DVD-RW, and it booted that Mac (using the firmware’s BIOS
    compatibility support module) and updated the SSD’s firmware without a problem.

    Chris Murphy

  • is that because the drive is compressing the information?  is there a way to turn this off?  i hate mandatory compression as losing one bit in a compressed file tends to be a big deal compared to the same in an uncompressed file.

    Securely sent with Tutanota. Claim your encrypted mailbox today!
    https://tutanota.com

    10. Aug 2017 10:06 by lists@colorremedies.com:

  • I agree.

    It’s a bad idea to do without swap even if you almost never use it, because today’s bloated apps often have many pages of virtual memory they rarely or never actually touch. You want those pages to get swapped out quickly so that the precious RAM can be used more productively; by the buffer cache, if nothing else.

    I once used a web application server on a headless VPS that still had GUI libraries linked to its binary because one of the underlying technologies it uses was also used in a GUI app, and it was too difficult to tear all that GUI code out, even if it was never called. Because the VPS technology didn’t support swap, I directly paid the price for those megs of unused (and unusable!) libraries in my monthly VPS rental fees.

    Me, neither. I’m pretty sure the spare sector pool’s size isn’t reported to the OS, and the drive isn’t allowed to dip into the sectors it does expose externally for spares.

    When the spare pool is used up, the drive just starts failing in a way that even SMART can see.

  • most modern virtual memory OS’s don’t swap out unused pages, instead, they swap IN accessed pages directly from the executable file. only thing written to swap are ‘dirty’ pages that have been changed since loading.


    john r pierce, recycling bits in santa cruz

  • No. I believe by “probabilistic representation” the parent poster simply means that in any given data cell, you don’t have a hard “1” or “0”, you have some voltage potential which can be interpreted as some number of 1 or 0 bits, often 3 bits or more.

    Between that fact and wear-leveling, you can’t take a simple voltage measurement on a data cell and say, “This cell contains 011.” You need more smarts about what’s going on to turn the voltage reading into the correct value.

    As the drive’s data cells wear out, the drive’s ability to do that correctly and reliably degrade. Thus cell death, thus drive death, thus filesystem death, thus backups, else sadness.

    And please don’t top-post.

    A: Yes.

    Q: Are you sure?

    A: Because it makes the flow of conversation more difficult to read.

    Q: Why shouldn’t I top-post?

  • Is that not a distinction without a difference in my case?

    Let’s say I have a system with 256 MB of free user-space RAM, and I have a binary that happens to be nearly 256 MB on disk, between the main executable and all the libraries it uses.

    Question: Can my program allocate any dynamic RAM?

    The OS’s VMM is free to use addresses beyond 0-256 MB, but since we’ve said there is no swap space, everything swapped in must still be assigned a place in physical RAM *somewhere*.

    Is there a meaningful distinction between:

    Scenario 1: The application’s first few executable pages are loaded from disk, a few key libraries are loaded, then the application does a dynamic memory allocation, then somehow causes all the rest of the executable pages to be loaded, running the system out of RAM.

    Scenario 2: The application is entirely loaded into RAM, nearly filling it, then the application attempts a large dynamic memory allocation, causing an OOM error.

    Regardless of the answer to these questions, I can tell you that switching that web site to a more efficient web application stack allowed us to shrink the VPS from a 256 MB plan, under which it would occasionally crash and require a reboot, to a 64 MB plan, under which the site has been rock-solid. Same VPS provider, same web site content, same user-facing functionality.

    If I’d had the ability to assign swap space, I probably could have gotten away with a 64 MB VPS plan with the inefficient web technology, too. They gave me plenty of disk space with that plan.

    (And no, swapon /some-file is no solution here. The VPS technology simply didn’t allow swap space, even from a swap file on one of the system disks. It wasn’t simply an inability to add a swap partition.)

  • Chris Murphy wrote:

    That´s one thing I´ve been wondering about: When using btrfs RAID, do you need to somehow monitor the disks to see if one has failed?

    I can´t go with a more recent kernel version before NVIDIA has updated their drivers to no longer need fence.h (or what it was).

    And I thought stuff gets backported, especially things as important as file systems.

    That would suck badly to the point at which I´d have to look for yet another distribution. The only one ramaining is arch.

    What do they suggest as a replacement? The only other FS that comes close is ZFS, and removing btrfs alltogether would be taking living in the past too many steps too far.

  • You really don’t want to do that without first finding out what file is using that block. You will convert a detected I/O error into silent corruption of that file, and that is a much worse situation.

  • Yeah he’d want to do an fsck -f and see if repairs are made, and also rpm -Va. There *will* be legitimately modified files, so it’s going to be tedious to exactly sort out the ones that are legitimately modified vs corrupt. If it’s a configuration file, I’d say you could ignore it but any modified binaries other than permissions need to be replaced and is the likely culprit.

    The smartmontools page has hints on how to figure out what file is affected by a particular sector being corrupt but the more layers are involved the more difficult that gets. I’m not sure there’s an easy to do this with LVM in between the physical device and file system.

  • fsck checks filesystem metadata, not the content of files. It is not going to detect that a file has had 512 bytes replaced by zeros. If the file is a non-configuration file installed from an RPM, then “rpm -Va” should flag it.

    LVM certainly makes the procedure harder. Figuring out what filesystem block corresponds to that LBA is still possible, but you have to examine the LV layout in /etc/lvm/backup/ and learn more than you probably wanted to know about LVM.

  • Chris might have been thinking of fsck -c or -k, which do various sorts of badblocks scans.

    That’s still a poor alternative to strong data checksumming and Merkle tree structured filesystems, of course.

    Linux kernel 4.8 added a feature called reverse mapping which is intended to solve this problem.

    In principle, this will let you get a list of files that are known to be corrupted due to errors at the block layer, then fix it by removing or overwriting those files. The block layer, DM, LVM2, and filesystem layers will then be able to understand that those blocks are no longer corrupt, therefore the filesystem is fine, as are all the possible layers in between.

    This understanding is based on a question I asked and had answered on the Stratis-Docs GitHub issue tracker:

    https://github.com/stratis-storage/stratis-docs/issues/53

    We’ll see how well it works in practice. It is certainly possible in principle: ZFS does this today.

  • Robert Nichols wrote:

    I posted a link yesterday – let me know if you want me to repost it – to someone’s web page who REALLY knows about filesystems and sectors, and how to identify the file that a bad sector is part of.

    And it works. I haven’t needed it in a few years, but I have followed his directions, and identified the file on the bad sector.

    mark