ZFS over iSCSI Storage in Proxmox

ggm · on Jan 30, 2024

Turn on Jumbograms. I ran iSCSI for ZFS served out via NFS on distinct cards so I had less contention for the disc fetch against the NFS serve, and it worked "ok" but that was FreeBSD (partly) and 9000 MTU definitely made a difference. Possibly right-sizing the MTU to be bigger than the blocksize is a tuning distinction but 9k jumbo definitely improved things.

Why send 4 packets when one will do? Same volume of data, less switch burden to latch it through.

tw04 · on Jan 31, 2024

After years in enterprise storage with endless performance testing: there's almost no point. Modern CPUs and NICs barely benefit. In 2005 when we had dual-core CPUs that were constantly buried and NICs with 0 offload, it made a ton of sense - heck we had dedicated iSCSI HBAs (QLA4010 represent!).

That being said, if you've got three servers and two VLANs with no worries about the jumbos ever escaping: I guess? But if you see even a 5% performance increase, I'll be shocked. On the flip side you're one misconfiguration away from endless troubleshooting if those jumbos escape.

Also *jumbo frames.

ggm · on Jan 31, 2024

That's great feedback. If it doesn't help, don't do it.

vardump · on Jan 30, 2024

It'd be great to have an MTU of say 64 KiB or greater.

Although I guess you'd also need a longer than 32-bit CRC to detect all the possible 3 bit errors past 11 kB frame size. A 40-bit CRC would be sufficient, at least up to 188 kB frame size or so.

throw0101d · on Jan 30, 2024

See perhaps "Best CRC Polynomials":

* https://users.ece.cmu.edu/~koopman/crc/

vardump · on Jan 30, 2024

Thanks. Pretty nice!

zamadatix · on Jan 30, 2024

If we were redoing Ethernet I wouldn't mind removing the CRC completely. If you want end to end reliability you should do it in the layer above ethernet. If you want link per link packet validation in a network we've already been layering advanced FEC algorithms at the physical layer for high speed Ethernet. The advantage of the latter being it's both optional and replaceable without requiring even more dynamic bits or redundant functionality in the layer 2 packet. Then on MTU make it a 32 bit field instead of a 16 bit field in case anyone wants to make hardware that supports more than 64k in the future.

ggm · on Jan 30, 2024

Some people say for TCP, smaller packets give better acknowledgement pacing. ISCSI is kinda over local, single switch links for me, but general purpose TCP streamed data it may well be "smaller is better" for rate estimates and window management

vardump · on Jan 30, 2024

Nothing prevents you from using a smaller MTU... well "TU" for TCP frames.

Less packets to process would speed things up. Fever headers to process.

Also if your TCP flow bandwidth is counted in tens or hundreds of gigabits per second, there's still going to be plenty of ACKs.

bewaretheirs · on Jan 30, 2024

MSS (maximum segment size) is the term at the TCP layer. Each end of a connection can (and usually does) declare its MSS in a TCP option in the first packet it sends.

Advertised MSS, interface MTU, and route MTU can all constrain packet sizing.

Using large-MTU routes for internal destinations can work well.

vardump · on Jan 30, 2024

Right, vaguely aware of that, but been a while. Thanks for correcting!

ithkuil · on Jan 30, 2024

What amount of packet loss were you experiencing in your setup?

ggm · on Jan 30, 2024

Low enough we had viable mounts before, but the retransmit counts were big. I don't have the host any more, moved to an ix system truenas. Probably I should have looked harder on the provider side.

throw0101d · on Jan 30, 2024

One can then use local disk as a ZIL to improve IOps.

When "Hybrid Storage Pools" storage pools were first introduced in 2008, when flash was still really expensive, this was a clever way of balancing speed and bulk storage with budget constraints:

* https://ahl.dtrace.org/2008/11/10/hybrid-storage-pools-in-th...

Nowadays flash is cheap/er, so all-flash storage is much more popular, with many storage products able to do tiered storage where (c)older data bits are shuffled from fast-expensive flash to slow-cheaper spinning rust.

mindslight · on Jan 30, 2024

I hope the author really does mean "home lab" and not "home production". Having run my own personal disk array for over two decades, this is like the opposite of what I've come to want. The simpler and more straightforward things can be, the better. Otherwise when things fail (and they will fail, despite that redundancy (or even perhaps because of it)), you'll end up with circular dependencies that make diagnosing and fixing things quite painful.

dixie_land · on Jan 30, 2024

I have a similar setup but for networking I find some really cheap InfiniBand cards (20-40) on eBay and configure iSER in tatgetcli

nicman23 · on Jan 31, 2024

> iSER in tatgetcli

google only lists your comment as a result. that is that ?

daoistmonk · on Jan 31, 2024

looks like an infiniband specific optimzation

probably this: iser: https://enterprise-support.nvidia.com/s/article/what-is-iser... tgtadmin: https://github.com/fujita/tgt

https://enterprise-support.nvidia.com/s/article/howto-config...

dunno7456 · on Jan 31, 2024

targetcli might be the correct word and iSER seems like RDMA for iSCSI

iotapi322 · on Jan 30, 2024

This is great, setting the same up with truenas wasn't as easy as I had hoped.