Most HDDs are going to Server Market. And from that point of view the capacity i...

rubiquity · on Nov 7, 2020

And for the server market there's a sweet spot for each workload for how big you want your spinning disks to be. These disks aren't getting any faster and rebuilding a 22 TB HDD is going to take a while and that imposes serious durability risks.

ody4242 · on Nov 7, 2020

It's not just the rebuild,some storage software (CEPH for example) also validate the data on the disks from time to time, and since the iops is quite limited on spindles, it takes more and more of that iops bucket just to verify the data that you have on the disks.

bayindirh · on Nov 7, 2020

ZFS also periodically re-silvers the disks to keep the FS in top shape. IIRC ZFS tries to re-silver disks when the traffic is low but, it's not always possible.

I bet that the disks is 16+ TB range will be used for colder tiers of the storage. Also, they should be useful in the OST of the LustreFS since the random read storm hits the MDT more severely.

karamanolev · on Nov 7, 2020

A friendly minor correction - a resilver is only when a drive is replaced, the periodic checksum check is a scrub in ZFS parlance.

magicalhippo · on Nov 8, 2020

And the scrub has to be initiated somehow, typically cron job, it's not automatic in ZFS.

Though NAS distributions like TrueNAS/FreeNAS sets this up for you by default.

bayindirh · on Nov 8, 2020

> And the scrub has to be initiated somehow, typically cron job, it's not automatic in ZFS.

The devices were Oracle/Sun ZFS appliances (I tortured a 7320, we liked it & bought a full-out 7420) so, maybe it was set up to scrub/resilver automatically in some cases.

As I've aforementioned, we're more of a Lustre shop and retired the systems some years ago.

bayindirh · on Nov 8, 2020

> A friendly minor correction...

Thank you. I'm no expert in ZFS TBH (We use Lustre much more) but, IIRC, when I was benchmarking the then new Sun Oracle ZFS 7320, I remember it was resilvering the disks after especially torturous loads, at night.

Maybe it was specific to the appliances (Our behemoth 7420 did the same) or, something was wrong. I remember Oracle/Sun guys jokingly asking me whether I succeeded to make it resilver the disks and, hearing it did indeed resilver the disks a dozen times visibly upset them. They've only said that "Pack it up, we need to go".

Fun times, it was.

Fnoord · on Nov 10, 2020

Hmm, if you had to resilver the disks a dozen times, I would assume that is indication of a hardware failure somewhere in the device (perhaps RAM?).

rubiquity · on Nov 7, 2020

Yes that’s a good call out. I was being lazy but there are quite a few different reasons to need to scan the entire disk.

formerly_proven · on Nov 7, 2020

Since all heads are mounted to the same arm, only one head can lock onto a track at a given time. What if each head had independent micro-actuators so that all heads can be locked to their track of the same radius, with the data being distributed across all heads. Wouldn't this improve throughput n-fold?

Edit: Seems like modern hard drives already have micro-actuators for each head to overcome precision and bandwidth issues with the main arm actuator, but it seems like none of those enable a range of motion that is sufficient to lock multiple heads simultaneously.

rbanffy · on Nov 7, 2020

For a long time, mainframes had hard disks with a set of heads on opposite sides of the platters. I'm not sure the use cases for spinning disks these days justify that kind of investment - they aren't (or shouldn't be) used in random-write-heavy workloads (as in frequently updated databases), but more for archival and, often, they sit behind a flash disk acting as a cache (I do that for my home server - a lot of write traffic never hits the disk because it's overwritten before being evicted from the flash)

XorNot · on Nov 8, 2020

Data verification to predict drive failure seems like a decent use case though. The drive is always spinning, so you essentially gain an extra data path to do read-verify-fix IOPs. Whether you can build this in cost-effectively is a big question though. But at rebuild times climbing to upwards of a week, being unable to do continuous health monitoring starts to get real problematic.

rbanffy · on Nov 8, 2020

This would be cool, but it can be done with a single set of heads if the utilization is less than 100%.

Multiple sets of heads would be useful if the limiting factor is positioning.

BTW, I don't know how a multi platter drive records its disk blocks. Is a block contained in a single platter or does it spread across all platters and read/writes from all heads at the same time?

Dual arms would be handy if the drive had a RAID-like checksumming scheme between platters. If a platter is corrupted but is still readable/writable (it wasn't a head malfunction), the drive could rebuild itself without the help of a computer.

Even if it is a head malfunction, the data could be redistributed between the other platters, reducing the drive capacity.

wtallis · on Nov 8, 2020

The specifics of how data is physically arranged across multiple platters is undocumented, complicated, and varies between models. But with some clever benchmarks, much of that information can be inferred: https://blog.stuffedcow.net/2019/09/hard-disk-geometry-micro...

A hard drive will only use one head on one platter at a time. A single logical block will be contained within a single track on one platter. The next logical block will usually be on the same track or an adjacent track on the same platter. Seeking from one track to the next using the same head is generally a bit quicker than switching to a different head on a different platter and getting it lined up with a nearby track.

droffel · on Nov 7, 2020

If you've got slow HDDs, the usual solution is to RAID them together. At that point your limiter starts becoming how fast you can slurp data down the line. RAID-0 would sort of emulate what you're talking about here.

R0b0t1 · on Nov 7, 2020

That still trades off density for speed, having independent arms would be a huge benefit.

rbanffy · on Nov 7, 2020

Would it still be cheaper than a similar SSD? Would it be cheaper than an SSD-augmented HD?

droffel · on Nov 8, 2020

> That still trades off density for speed

It does not. RAID-0 has no duplication, it acts effectively like your independent arms. No density tradeoff, just speed.

R0b0t1 · on Nov 8, 2020

There is a tradeoff: With RAID0 your parallelization for read or write is on the scale of 24TB chunks, not within 24TB chunks.

chungy · on Nov 7, 2020

I think a big problem too is that SSD prices haven't gone down fast enough. Basically the only reason that spinning rust is still a thing, is because SSDs are far more expensive.

jeffbee · on Nov 7, 2020

Apples and oranges, right? You need ten thousand hard drives to match the IOPS of one SSD and even with the 10000 disks your service latency will still be three orders of magnitude worse. The other advantage of an SSD is in terms of bytes per volume, in case rack density matters to you.

taneliv · on Nov 7, 2020

Putting 24TB parts in a Backblaze Storage Pod 6.0 would presumably allow 1440TB in a 4U rack mount server.[1] In practice, how would you reach the same density with SSDs? (I haven't looked into it, just curious if you know that SSD options more dense than that exist, and whether they are equally openly documented.)

[1] https://www.backblaze.com/blog/open-source-data-storage-serv...

ddulaney · on Nov 7, 2020

There aren’t many full open-source solutions available, but you can buy 100TB 3.5” SSDs for data center use.[1] At Storage Pod densities, that’s 6000TB in 4U. I’m not sure if some other factor comes in that limits density, but that’s a first-order estimate.

[1] https://nimbusdata.com/products/exadrive/pricing/

rbanffy · on Nov 7, 2020

Heat dissipation may be a concern, but, if we are talking 6PB of flash storage, we can pretty much consider custom liquid cooling is an option.

brigade · on Nov 7, 2020

There are 1U servers with 32 EDSFF slots, which were advertised to reach 1PB with 32TB SSDs. But 16TB is more common, and that's still 2PB in 4U that you can buy today without getting into exotic pricing.

marmaduke · on Nov 7, 2020

If 2PB SSD is not exotic pricing.. I’m glad I can work with off the shelf stuff.

jeffbee · on Nov 7, 2020

You can put 36 15TB NGSFF SSDs into a 1U height enclosure and only 12cm deep. SSD volumetric density is a lot higher than disk and has been for a few years now.

detaro · on Nov 7, 2020

SSDs are smaller, so you can pack more of them in the same volume. Not aware of really open designs (maybe in the OpenCompute project there are some), but just from a quick look at Supermicros homepage:

Something like https://www.supermicro.com/en/products/system/1U/1029/SSG-10... can fit 32 SSDs à 16 TB in a 1U system (Intel at some point announced 32 TB models in the same format, but I'm unsure if they ever were available).

This style can fit 48x 16TB in 2U: https://www.supermicro.com/en/products/system/2U/2028/SSG-20...

The trouble with such dense SSD capacity is more being able to interface them fast enough with the host and the outside world.

mciancia · on Nov 7, 2020

Well, there are 100tb 3.5" SSDs available...

ReptileMan · on Nov 7, 2020

Let me guess - if you have to ask for the price, you can't afford it ...

vxNsr · on Nov 8, 2020

price isn't that bad... only $40K[0]

[0]https://www.newegg.com/nimbus-data-dc-100tb/p/2U3-002M-00004...

iptrans · on Nov 8, 2020

Gotta love that $29.00 shipping charge. It's not like they could have afforded to offer free shipping at those prices :)

votepaunchy · on Nov 7, 2020

You need both SSD and HDD, the latter for archival storage and data replication. And that only works at data center scale.

chungy · on Nov 8, 2020

IOPS differences are exaggerated. You can generally beat consumer-grade SSDs with just four HDDs.

wtallis · on Nov 8, 2020

That depends very much on the workload. Four hard drives can deliver in aggregate a sequential bandwidth that exceeds any one consumer SATA SSD, or the sustained write bandwidth of most consumer NVMe SSDs. But when people are discussing IOPS, the usual implication is that they're talking about non-sequential access of relatively small block sizes. For those workloads, the difference between consumer SSDs and hard drives are still measured in orders of magnitude.

So, what workloads did you have in mind when you said "generally"?