is there anything that speaks against putting a cyrus mail spool onto a btrfs subvolume?
anything, much less a mail spool. I used it in production on DB and Web servers and fought corruption issues and scrubs hanging the system more times than I can count. (This was within the last 24 months.) I was told by certain mailing lists, that btrfs isn’t considered production level. So, I scrapped the lot, went to xfs and haven’t had a problem since.
I’m not sure why you’d want your mail spool on a filesystem and seems to hate being hammered with reads/writes. Personally, on all my mail spools, I use XFS or EXT4. OUr servers here handle 600million messages a month without trouble on those filesystems.
Just my $0.02.
Mark Haney Network Engineer at NeoNova
919-460-3330 option 1
Mark Haney wrote:
Btrfs appears rather useful because the disks are SSDs, because it allows me to create subvolumes and because it handles SSDs nicely. Unfortunately, the SSDs are not suited for hardware RAID.
The only alternative I know is xfs or ext4 on mdadm and no subvolumes, and md RAID has severe performance penalties which I´m not willing to afford.
Part of the data I plan to store on these SSDs greatly benefits from the low latency, making things about 20–30 times faster for an important application.
So what should I do?
What kind of storage solutions do people use for cyrus mail spools? Apparently you can not use remote storage, at least not NFS. That even makes it difficult to use a VM due to limitations of available disk space.
I´m reluctant to use btrfs, but there doesn´t seem to be any reasonable alternative.
I think it depends on who you ask. Facebook and Netflix are using it extensively in production:
Though they have the in-house kernel engineering resources to troubleshoot problems. When I see quotes like this  on the product’s WIKI:
“The parity RAID code has multiple serious data-loss bugs in it. It should not be used for anything other than testing purposes.”
I’m reluctant to store anything of value on it. Have you considered using ZoL? I’ve been using it for quite some time and haven’t lost data.
– Ryan http://prefetch.net
I hate top posting, but since you’ve got two items I want to comment on, I’ll suck it up for now.
Having SSDs alone will give you great performance regardless of filesystem. BTRFS isn’t going to impact I/O any more significantly than, say, XFS. It does have serious stability/data integrity issues that XFS doesn’t have. There’s no reason not to use SSDs for storage of immediate data and mechanical drives for archival data storage.
As for VMs we run a huge Zimbra cluster in VMs on VPC with large primary SSD volumes and even larger (and slower) secondary volumes for archived mail. It’s all CentOS 6 and works very well. We process 600 million emails a month on that virtual cluster. All EXT4 inside LVM.
I can’t tell you what to do, but it seems to me you’re viewing your setup from a narrow SSD/BTRFS standpoint. Lots of ways to skin that cat.
Mark Haney Network Engineer at NeoNova
919-460-3330 option 1
It´s RAID1, not 5/6. It´s only 2 SSDs.
I do not /need/ to put the mail spool there, but it makes sense because the data that benefits from the low latency fills about only 5% of them, and the spool is mostly read, resulting in not so much wear of the SSDs.
I can probably do a test with that data on the hardware RAID, and if performance is comparable, I rather put it there than on the SSDs.
Yes, and I´m moving away from ZFS because it remains alien, and the performance is poor. ZFS wasn´t designed with performance in mind, and that shows.
It is amazing that SSDs with Linux are still so pointless and that there is no file system available actually suited for production use providing features ZFS and btrfs are valued for. It´s even frustrating that disk access still continues to defeat performance so much.
Maybe it´s crazy wanting to put data onto SSDs with btrfs because the hardware RAID is also RAID1, for performance and better resistance against failures than RAID5 has. I guess I really shouldn´t do that.
Now I´m looking forward to the test with the hardware RAID. A RAID1
of 8 disks may yield even better performance than 2 SSDs in software RAID1 with btrfs.
I do, too, yet sometimes it´s reasonable. I also hate it when the lines are too long :)
It depends, i. e. I can´t tell how these SSDs would behave if large amounts of data would be written and/or read to/from them over extended periods of time because I haven´t tested that. That isn´t the application, anyway.
But mdadm does, the impact is severe. I know there are ppl saying otherwise, but I´ve seen the impact myself, and I definitely don´t want it on that particular server because it would likely interfere with other services. I don´t know if the software RAID of btrfs is better in that or not, though, but I´m seeing btrfs on SSDs being fast, and testing with the particular application has shown a speedup of factor 20–30.
That is the crucial improvement. If the hardware RAID delivers that, I´ll use that and probably remove the SSDs from the machine as it wouldn´t even make sense to put temporary data onto them because that would involve software RAID.
Do you use hardware RAID with SSDs?
That´s because I do not store data on a single disk, without redundancy, and the SSDs I have are not suitable for hardware RAID. So what else is there but either md-RAID or btrfs when I do not want to use ZFS? I also do not want to use md-RAID, hence only btrfs remains. I also like to use sub-volumes, though that isn´t a requirement (because I can use directories instead and loose the ability to make snapshots).
I stay away from LVM because that just sucks. It wouldn´t even have any advantage in this case.
I haven’t really been following this thread, but if your requirements are that heavy, you’re past the point that you need to spring some money and buy hardware RAID cards, like LSI, er, Avago, I mean, who’s bought them more recently?
Heavy requirements are not required for the impact of md-RAID to be noticeable.
Hardware RAID is already in place, but the SSDs are “extra” and, as I said, not suited to be used with hardware RAID.
It remains to be tested how the hardware RAID performs, which may be even better than the SSDs.
If your I/O is going to be heavy (and you’ve not mentioned expected traffic, so we can only go on what little we glean from your posts), then SSDs will likely start having issues sooner than a mechanical drive might. (Though, YMMV.) As I’ve said, we process 600 million messages a month, on primary SSDs in a VMWare cluster, with mechanical storage for older, archived user mail. Archived, may not be exactly correct, but the context should be clear.
I never said anything about MD RAID. I trust that about as far as I
could throw it. And having had 5 surgeries on my throwing shoulder wouldn’t be far. Again, if the idea is to have fast primary storage, there are pretty large SSDs available now and I’ve hardware RAIDED SSDs before without trouble, though not for any heavy lifting, it’s my test servers at home. Without an idea of the expected mail traffic, this is all speculation. We do not here where I work, but that was setup LONG before I arrived.
If the SSDs you have aren’t suitable for hardware RAID, then they aren’t good for production level mail spools, IMHO. I mean, you’re talking like you’re expecting a metric buttload of mail traffic, so it stands to reason you’ll need really beefy hardware. I don’t think you can do what you seem to need on budget hardware. Personally, and solely based on this thread alone, if I was building this in-house, I’d get a decent server cluster together and build a FC or iSCSI SAN to a Nimble storage array with Flash/SSD front ends and large HDDs in the back end. This solves virtually all your problems. The servers will have tiny SSD boot drives (which I prefer over booting from the SAN) and then everything else gets handled by the storage back-end.
In effect this is how our mail servers are setup here. And they are virtual. LVM is a joke. It’s always been something I’ve avoided like the plague.
Could someone, please, elaborate on the statement that “SSDs are not suitable for hardware RAID”.
Valeri Galtsev Sr System Administrator Department of Astronomy and Astrophysics Kavli Institute for Cosmological Physics University of Chicago Phone: 773-702-4247
It will depend on the type of SSD and the type of hardware RAID. There are at least 4 different classes of SSD drives with different levels of cache, write/read performance, number of lifetime writes, etc. There are also multiple types of hardware RAID. A lot of hardware RAID
will try to even out disk usage in different ways. This means ‘moving’
the heavily used data from slow parts to fast parts etc etc. On an SSD
all these extra writes aren’t needed and so if the hardware RAID
doesn’t know about SSD technology it will wear out the SSD quickly. Other hardware raid parts that can cause faster failures on SSD’s are where it does test writes all the time to see if disks are bad etc. Again if you have gone with commodity SSD’s this will wear out the drive faster than expected and boom bad disks.
That said, some hardware RAID’s are supposedly made to work with SSD
drive technology. They don’t do those extra writes, they also assume that the disks underneath will read/write in near constant time so queueing of data is done differently. However that stuff costs extra money and not usually shipped in standard OEM hardware.
Stephen J Smoogen.
Wow, you learn something every day ;-) Which hardware RAIDs do these moving of data (manufacturer/model, please – believe it or not I never heard of that ;-). And “slow part” and “fast part” of what are data being moved between?
Thanks in advance for tutorial!
I thought it was HP who had these, but I can’t find it.. which means without references… I get an F. My apologies on that. Thank you for keeping me honest.
I/O is not heavy in that sense, that´s why I said that´s not the application. There is I/O which, as tests have shown, benefits greatly from low latency, which is where the idea to use SSDs for the relevant data has arisen from. This I/O
only involves a small amount of data and is not sustained over long periods of time. What exactly the problem is with the application being slow with spinning disks is unknown because I don´t have the sources, and the maker of the application refuses to deal with the problem entirely.
Since the data requiring low latency will occupy about 5% of the available space on the SSDs and since they are large enough to hold the mail spool for about 10 years at its current rate of growth besides that data, these SSDs could be well used to hold that mail spool.
How else would I create a RAID with these SSDs?
I´ve been using md-RAID for years, and it always worked fine.
The SSDs don´t need to be large, and they aren´t. They are already greatly oversized at
512GB nominal capacity.
There´s only a few hundred emails per day. There is no special requirement for their storage, but there is a lot of free space on these SSDs, and since the email traffic is mostly read-only, it won´t wear out the SSDs. It simply would make sense to put the mail spool onto these SSDs.
Probably with the very expensive SSDs suited for this …
If SSDs not suitable for RAID usage aren´t suitable for production use, then basically all SSDs not suitable for RAID usage are SSDs that can´t be used for anything that requires something less volatile than a ramdisk. Experience with such SSDs contradicts this so far.
There is no “storage backend” but a file server, which, instead of 99.95% idling, is being asisgned additional tasks, and since it is difficult to put a cyrus mail spool on remote storage, the email server is one of these tasks.
You have entirely different requirements.
I´ve also avoided it until I had an application where it would have been advantageous if it actually provided the benefits it seems supposed to provide. It turned out that it didn´t and only made things much worse, and I continue to stay away from it.
After all, you´re saying it´s a bad idea to use these SSDs, especially with btrfs. I don´t feel good about it, either, and I´ll try to avoid using them.
Mark Haney wrote:
a Dell server: Dell, at least, offers two kinds of SSDs, one for heavy write, I think it was, and one for equal r/w. You might dig into that.
Odd, we’ve never seen anything like that. Of course, we’re not handling the kind of mail you are… but serious scientific computing hits storage hard, also.
Why? We have it all over, and have never seen a problem with it. Nor have I, personally, as I have a RAID 1 at home.
Valeri Galtsev wrote:
When you search for it, you´ll find that besides wearing out undesirably fast — which apparently can be contributed mostly to less overcommitment of the drive — you may also experience degraded performance over time which can be worse than you would get with spinning disks, or at least not much better.
Add to that the firmware being designed for an entirely different application and having bugs, and your experiences with surprisingly incompatible hardware, and you can imagine that using an SSD not designed for hardware RAID
applications with hardware RAID is a bad idea. There is a difference like night and day between “consumer hardware” and hardware you can actually use, and that is not only the price you pay for it.
That’s a biggie: are these SSDs consumer grade, or enterprise grade? It was common knowledge 8-9 years ago that you *never* want consumer grade in anything that mattered, other than maybe a home PC – they wear out much sooner.
But then, you can’t really use consumer grade h/ds in a server. We like the NAS-rated ones, like WD Red, which are about 1.33% the price of consumer grade, and solid… and a lot less than the enterprise-grade, which are about 3x consumer grade.
Make a test and replace a software RAID5 with a hardware RAID5. Even with only 4 disks, you will see an overall performance gain. I´m guessing that the SATA controllers they put onto the mainboards are not designed to handle all the data — which gets multiplied to all the disks — and that the PCI bus might get clogged. There´s also the CPU being burdened with the calculations required for the RAID, and that may not be displayed by tools like top, so you can be fooled easily.
Graphics cards have acceleration in hardware for a reason. What was the last time you tried to do software rendering, and what frame rates did you get? :)
Offloading the I/O to a designated controller gives you room for the things you actually want to do, similar to a graphics card.
See, this is the kind of information that would have made this thread far shorter. (Maybe.) The one thing that you didn’t explain is whether this application is the one /using/ the mail spool or if you’re adding Cyrus to that system to be a mail server. Possibly, but that’s somewhat irrelevant. I’ve taken off the shelf SSDs and hardware RAID’d them. If they work for the hell I put them through
(processing weather data), they’ll work for the type of service you’re saying you have. Not true at all. Maybe 5 years ago SSDs were hit or miss with hardware RAID. Not anymore. It’s just another drive to the system, the controllers don’t know the difference between a SATA HDD and a SATA SSD. Couple that with the low volume of mail, and you should be fine for HW
RAID. Again, you never mentioned the volume of mail expected, and your previous threads seemed to indicate you were expecting enough to cause issues with SSDs and BTRFS. In IT when we get a ‘my printer is broken’, we ask for more info since that’s not descriptive enough. If this server as is asleep and you
(now) make it sound, BTRFS might be fine. Though, personally, I’d avoid it regardless. I know that now. Previously, you made it sound the your mail flow would be a lot closer to ‘heavy’ than what you’ve finally described. I can only offer thoughts based on what information I’m given. worth using in any server. The SSD question, prompted by you, was whether the SSDs could:
1) be hardware RAID’d
2) handle the load of mail you were expecting.
512GB SSDs are new enough to probably be HW RAID’d fine, assuming they are weird ones from a third party no one has really heard of. I know because my last company bought some inexpensive (I call them knockoffs)
third party SSDs that were utter crap from the moment an OS was installed on them. If yours are from Seagate, WG, or other bigname drive maker, I would be surprised if they choked being on a hardware RAID card. A setup like yours doesn’t appear to need ‘Enterprise’ level hardware, SMB hardware appears would work for you just as well.
Just not with BTRFS. On any drive. Ever.
Mark Haney wrote:
Actually, with the usage you’re talking about, I’m surprised you’re using SATA and not SAS.
This is what Red Hat says about btrfs:
The Btrfs file system has been in Technology Preview state since the initial release of Red Hat Enterprise Linux 6. Red Hat will not be moving Btrfs to a fully supported feature and it will be removed in a future major release of Red Hat Enterprise Linux.
The Btrfs file system did receive numerous updates from the upstream in Red Hat Enterprise Linux 7.4 and will remain available in the Red Hat Enterprise Linux 7 series. However, this is the last planned update to this feature.
Thanks. That seems to clear fog a little bit. I still would like to hear manufacturers/models here. My choices would be: Areca or LSI (bought out by Intel, so former LSI chipset and microcode/firmware) and as SSD Samsung Evo SATA III. Does anyone who used these in hardware RAID can offer any bad experience description?
I am kind of shying away from “crap” hardware which in a long run is more expensive, even though looks cheaper (Pricegrabber is your enemy – I would normally say to my users). So, I never would consider using poorly/cheaply designed hardware in some setup (e.g. hardware RAID based storage) one expects performance from. Am I still taking chance hitting “bad” hardware RAID + SSD combination? Just curious where we actually stand.
Thanks again for fruitful discussion!
Does the Samsung EVO have supercaps and write-back buffer protection?
if not, it is in NO way suitable for reliable use in a raid/server environment.
as far as raiding SSDs go, the ONLY raid I’d use with them is raid1
mirroring (or if more than 2, raid10 striped mirrors). And I’d probably do it with OS based software raid, as thats more likely to support SSD
trim than a hardware raid card, plus allows the host to monitor the SSDs via SMART, which a hardware raid card probably hides.
I’d also make sure I undercommit the size of the SSD, so if its a 500GB
SSD, I’d make absolutely sure to never have more than 300-350GB of data on it. if its part of a stripe set, the only way to ensure this is to partition it so the raid slice is only 300-350GB.
john r pierce, recycling bits in santa cruz
Intel only purchased the networking component of LSI, Axxia, from Avago. The RAID division was merged into Broadcom (post Avago merger).
That sounds like a whole lot of guesswork, which I’d suggest should inspire slightly less confidence than you are showing in it.
RAID parity calculations are accounted under a process named md_raid. You will see time consumed by that code under all of the normal process accounting tools, including total time under
“ps” and current time under “top”. Typically, your CPU is vastly faster than the cheap processors on hardware RAID controllers, and the advantage will go to software RAID over hardware. If your system is CPU
bound, however, and you need that extra fraction of a percent of CPU
cycles that go to calculating parity, hardware might offer an advantage.
The last system I purchased had its storage controller on a PCIe 3.0 x16
port, so its throughput to the card should be around 16GB/s. Yours might be different. I should be able to put roughly 20 disks on that card before the PCIe bus is the bottleneck. If this were a RAID6
volume, a hardware RAID card would be able to support sustained writes to 22 drives vs 20 for md RAID. I don’t see that as a compelling advantage, but it is potentially an advantage for a hypothetical hardware RAID card.
When you are testing your 4 disk RAID5 array, microbenchmarks like bonnie++ will show you a very significant advantage toward the hardware RAID as very small writes are added to the battery-backed cache on the card and the OS considers them complete. However, on many cards, if the system writes data to the card faster than the card writes to disks, the cache will fill up, and at that point, the system performance can suddenly and unexpectedly plummet. I’ve fun a few workloads where that happened, and we had to replace the system entirely, and use software RAID instead. Software RAID’s performance tends to be far more predictable as the workload increases.
Outside of microbenchmarks like bonnie++, software RAID often offers much better performance than hardware RAID controllers. Having tested systems extensively for many years, my advice is this: there is no simple answer to the question of whether software or hardware RAID is better. You need to test your specific application on your specific hardware to determine what configuration will work best. There are some workloads where a hardware controller will offer better write performance, since a battery backed write-cache can complete very small random writes very quickly. If that is not the specific behavior of your application, software RAID will very often offer you better performance, as well as other advantages. On the other hand, software RAID absolutely requires a monitored UPS and tested auto-shutdown in order to be remotely reliable, just as a hardware RAID controller requires a battery backed write-cache, and monitoring of the battery state.
With all due respect, John, this is the same as hard drive cache is not backed up power wise for a case of power loss. And hard drives all lie about write operation completed before data actually are on the platters. So we can claim the same: hard drives are not suitable for RAID. I implied to find out from experts in what respect they claim SSDs are unsuitable for hardware RAID as opposed to mechanical hard drives.
Am I missing something?
Good, thanks. My 3ware RAIDs through their 3dm daemon do warn me about SMART status: fail (meaning the drive though working should according to SMART be replaced ASAP). Not certain off hand about LSI ones (one should be able to query them through command line client utility).
Great point! And one may want to adjust stripe size to be resembling SSDs internals, as default is for hard drives, right?
Thanks, John, that was instructive!
major difference is, SSD’s do a LOT more write buffering as their internal write blocks are on the order of a few 100KB, also they extensively reorder data on the media, both for wear leveling and to minimize physical block writes so there’s really no way the host and/or controller can track whats going on.
enterprise hard disks do NOT do hidden write buffering, its all fully managable via SAS or SATA commands. desktop drives tend to lie about it to achieve better performance. I do NOT use desktop drives in raids.
as the SSD physical data blocks have no visible relation to logical block numbers or CHS, its not practical to do this. I’d use a fairly large stripe size, like 1MB, so more data can be sequentially written to the same device (even tho the device will scramble it all over as it sees fit).
John R Pierce wrote:
Isn´t it easier for SSDs to write small chunks of data at a time?
The small chunk might fit into some free space more easily than a large one which needs to be spread out all over the place.
They are not specifically enterprise rated and especially not for use with hardware RAID.
Similar SSDs are in use in a server for about 2 years now as cache for ZFS, and there haven´t been any issues with them.
Those are pretty worthwhile, though not the fastest. Out of 14, one has failed over the last 3 years or so, and it was still under warranty. They do serve their purpose.
SSDs read/write in large-ish (256k-4M) blocks/pages. Seems to me that drive blocks and hardware RAID strip size and file system block/cluster/extents sizes and etc and etc and etc should be aligned for best performance.
Specifically the section:
NAND-flash pages and blocks
It was a simple question to begin with; I only wanted to know if something speaks against using btrfs for a cyrus mail spool. There are things that speak against doing that with NFS, so there might be things with btrfs.
The application doesn´t use the mail spool at all, it has its own dataset.
Well, I can´t very well test them with the mail spool, so I´ve beeing going with what I´ve been reading about SSDs with hardware RAID.
I´d need another controller to do hardware RAID, which would require another slot on board, and IIRC, there isn´t a suitable one free anymore. Or I´d have to replace two of the other disks with the SSDs, and that won´t be a good thing to do.
Of course — the issue, or question, is btrfs, not the SSDs.
Yes, I´m the one saying not to use them. My question was if there´s anything that speaks against using btrfs for a cyrus mail spool. It wasn´t about SSDs.
Hardware RAID for the SSDs is not really an option because the ports of the controllers are used otherwise, and it is unknown how well these SSDs would work with them. Otherwise I
wouldn´t consider using btrfs.
Well, that´s a problem because when you don´t want md-RAID and can´t do hardware RAID, the only other option is ZFS, which I don´t want either. That leaves me with not using the SSDs at all.
It depends on your budget and on the hardware you plan to use the controller with, and on what you´re intending to do. I wouldn´t recommend using SSDs that are not explicitly rated for use with hardware RAID with hardware RAID.
Samsung seems to have firmware bugs that makes the kernel/btrfs disable some features. I´d go with Intel SSDs and either use md-RAID or btrfs, but the reliability of btrfs is questionable, and md-RAID has a performance penalty.
It really depends on the RAID-controller and the SSDs. Every RAID-controller has a maximum number of IOPS it can process.
Also, as pointed out, consumer SSD have various deficiencies that make them unsuitable for enterprise-use:
https://blogs.technet.microsoft.com/filecab/2016/11/18/dont-do-it-consumer-ssd/ < https://blogs.technet.microsoft.com/filecab/2016/11/18/dont-do-it-consumer-ssd/>
Enterprise SSDs also fail much more predictably. You basically get an SLA with them about the DWPD/TBW data.
For small amounts of highly volatile data, I recommend looking into Optane SSDs.
As for BTRFS: RedHat dumped it. So, it’s a SuSE/Ubuntu thing right now. Make of that what you want ;-)
Personally, I’d prefer to use ZFS for SSDs. No Hardware-RAID for sure. Not sure if I’d use it on anything else but FreeBSD (even though a Linux port is available and code-wise it’s more or less the same).
From personal experience, it’s better to even ditch the non-RAID HBA and just go with NVMe SSDs for the 2.5“ drive slots (a.k.a. 8639 a.k.a U.2 form factor). If you have spare PCIe slots, you can also go for HHHL PCIe NVMe cards – but of course, you’d have to RAID them.
Gordon Messmer wrote:
It´s called “experience”. I haven´t tested a great number of machines extensively to experience the difference between software and hardware on them, and I agree with what you´re saying. It´s all theory until it has been suitably tested, hence my recommendation to test it.
Johnny Hughes wrote:
That surely speaks against it.
However, it´s hard to believe. They must be expecting btrfs never to become useable.
the SSD collects data blocks being written and when a full flash block worth of data is collected, often 256K to several MB, it writes them all at once to a single contiguous block on the flash array, no matter what the ‘address’ of the blocks being written is. think of it as a
different drive brands and models use different strategies for this, and all this is completely opaque to the host OS so you really can’t outguess or manage this process at the OS or disk controller level.