Nebula is not the fastest mesh VPN, but neither are the others

lenova · on Feb 18, 2024

Nebula (a distributed mesh overlay network first developed at Slack) released their own network benchmarks today:

https://www.defined.net/blog/nebula-is-not-the-fastest-mesh-...

This is in response to this oft-cited benchmark post by Netmaker:

https://medium.com/netmaker/battle-of-the-vpns-which-one-is-...

Previous discussions on encrypted overly mesh networks here:

"Would we still create Nebula today?" (Oct 2023) https://news.ycombinator.com/item?id=37871534

mike_d · on Feb 19, 2024

It should be noted the Netmaker benchmarks were ran in a shared cloud environment, so the throughput related metrics must be thrown out. You have no visibility into what other workloads are running on the same physical host or how your test hosts are physically connected. As a result you could be measuring software X with two instances on the same bare metal and software Y across two instances in different buildings (yes instances can move buildings without you knowing).

A proper test would be done with bare metal machines with 10/40/100G NICs and a controlled network plane free of other traffic.

zokier · on Feb 19, 2024

Not to defend the netmaker article too much (its pretty crappy), but the noise of shared cloud infra is not unmanageable, it just needs to be factored in in the plan and analysis. Just repeating the tests couple of times, preferably in different regions should give pretty meaningful results. Also recognizing the scale of things, noisy neighbors are extremely unlikely to cause order of magnitude difference, especially when operating at < 1gbps scale. Lastly, just using good instance types helps a lot here too, the T types netmaker used were pretty much the worst possible option. The table in the docs provides guidance:

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/general-...

Note how some types have "up to X Gigabit", some are just "X Gigabit", and for t2.micro that netmaker used the docs say just "low to moderate" :D

But all that being said, I think it is valuable to benchmark things in noisy real-world conditions too and not only in pristine lab conditions. It definitely requires bit more effort in the analysis side to understand whats going on. And benchmarking without good analysis is pretty bad regardless if it was done in public cloud or some dedicated lab.

zokier · on Feb 19, 2024

One more thing, from the docs you can also infer the level of overcommit of the networking of the host by looking at the documented perf of .metal instances. For example m6in.8xlarge has 50 Gbps networking and m6in.metal has 200 Gbps. But also m6in.metal is equivalent to m6in.32xlarge in terms of capacity, and 32xlarge is 4 times bigger than 8xlarge. So from that you can easily tell that m6in.8xlarge has no overcommit for networking because the network and CPU/mem allocation match exactly. At the same time you can tell that m6in.large does have network overcommit; m6in.large is 1/64th of the .metal instance in terms of size but 25 Gbps × 64 = 1600 Gbps, 8 times more than the 200 Gbps of the .metal instance!

Finally EC2 has controls for instance "tenancy" and "placement"; you can ensure that your instances are run on hosts dedicated to your account, again avoiding host-level noisy neighbours. Of course both of these aspects only control the physical host, the networking infra outside the host is afaik almost completely out of control. But at the same time I'd be surprised if there is any significant level of congestion within an AZ.

This is all to say that there is whole spectrum of things between "using random public instances with no consideration" and "set up private dedicated hardware lab", its not just binary choice.

aftbit · on Feb 19, 2024

I wish you would benchmark more common point-to-point VPNs as well. I see the comment explaining why you don't, but I think it would be useful. I'm comparing Tailscale, Nebula, etc against Wireguard with out-of-band static route management. These are very different beasts, but if I'm going to pay a 50% performance penalty for mesh routing, maybe I will architect my system to not require it. If it is more like a 5% penalty, then I will absolutely not.

linsomniac · on Feb 19, 2024

Isn't that effectively what the Netmaker showed? Netmaker was able to effectively saturate the 10-gig hardware, tailscale and nebula were close. Tailscale was the most CPU-efficient (but least memory efficient).

blop · on Feb 19, 2024

Nebula is great! From what I found after testing other solutions (headscale, netbird, netmaker) It's also the only completely open source mesh vpn that can be configured with a highly available control plane (just run multiple lighthouses, nothing is shared) and also supports multiple root CAs for nodes, relays and control planes (and each node can be a relay too)

I just wish there was a kubernetes operator to easily set up mesh sidecars like with tailscale and it would be perfect!

mintplant · on Feb 19, 2024

FYI, the Nebula mobile client is source-available but not open-source. The devs from Defined Networking have been cagey about this and don't make this fact obvious, which makes me wary of Nebula.

https://github.com/DefinedNet/mobile_nebula/issues/19

blop · on Feb 19, 2024

Fair enough about the android mobile client... My use case only involves meshing linux appliances across various networks so we only need the nebula core binaries which are under MIT license

https://github.com/slackhq/nebula/blob/master/LICENSE

pests · on Feb 19, 2024

TLDR: They haven't explicitly added a license, on purpose.

https://choosealicense.com/no-permission/

Sad optimism of the commentor to get a PR going if that was all it needed.

jsiepkes · on Feb 19, 2024

They could have just added a LICENSE file which stated you are not allowed to use to software without a commercial license. Instead they chose to be vague about it. Doesn't really inspire confidence.

incidentia · on Feb 19, 2024

nebula seemed like a very interesting choice, when we were looking for a mesh vpn, but the lack of ipv6 support led to it being removed from consideration very quickly

so i have been checking https://github.com/slackhq/nebula/issues/6 every time im reminded nebula exists, for the last few years, without success

wkat4242 · on Feb 19, 2024

Weird that they dont do ipv6. Tinc is much older and less polished but it can do ipv6 just fine.

I personally use only ipv4 on my overlay but I understand why people want it.

I never really looked at tailscale and zerotier because they're too commercial. Nebula I did try once but I didn't really like it. A bit too complex for what I need.

hexfish · on Feb 19, 2024

id for issue about ipv6: 6

RyeCombinator · on Feb 19, 2024

I respect the work that Ryan and the team is putting out. Excited to see what's coming next for Nebula.

davidAlm · on Feb 19, 2024

For three years, Elestio has trusted Nebula for our IT needs. We chose it for its solid stability, essential for our daily work. Nebula has been reliable without any major issues. Currently, we're deploying thousands of VMs, all seamlessly connecting through Nebula. Thanks to the Slack team for creating and maintaining Nebula.

braginini · on Feb 19, 2024

How about NetBird? https://github.com/netbirdio/netbird

chuckadams · on Feb 19, 2024

Technically, one of them is going to be the fastest.

thayne · on Feb 19, 2024

Not necessarily. "fastest" could mean many different things, and one might be faster in some situations but slower in others, or the difference in speed could less than the uncertainty in the measurement.

o11c · on Feb 19, 2024

For reference, the math for this is called https://en.wikipedia.org/wiki/Semiorder

CyberDildonics · on Feb 19, 2024

Technically, one of them is going to be the fastest and kitchy nonsense titles like this don't help anyone.

mberning · on Feb 19, 2024

I used speedify on iOS for a while and really liked it, but iOS’ abysmal ability to maintain a connection to a vpn was a major pain point.

uselpa · on Feb 19, 2024

Is Android better in this respect? I’m all-in in the Apple ecosystem, but iOS specifically makes me want to switch.

smilliken · on Feb 19, 2024

The issue with iOS is that the system will kill the VPN process if it uses more than a tiny amount of RAM (including buffers to send data over the connection). You also have to be sure to do keep-alive heartbeats or it will be killed for being idle. VPNs on Android work reliably, I haven't noticed any issue.

Saris · on Feb 19, 2024

No benchmark for wireguard?

jacobwg · on Feb 19, 2024

Wireguard's lower-level than the solutions that were benchmarked - in fact, Netmaker and Tailscale themselves use Wireguard as their backing technology.

But even with them both using Wireguard, there are choices involved that affect performance, for instance whether to use the Wireguard kernel model or userspace implementation, how to configure routing, packet filtering, firewalls, etc.

fmajid · on Feb 19, 2024

In any case, they are bumping against their NIC's performance, so they'd have to test using a 40GbE or 100GbE interface to show a difference.

blacklion · on Feb 19, 2024

Or links with large and variable delays. You know, domestic links, public wifis, mobile.

What is purpose of mesh on one physical segment?

fmajid · on Feb 19, 2024

To measure the intrinsic overhead of the mesh VPN, but your point is well-takem, robustness to latency is also an important consideration, probably more important when they are all within spitting distance.

Saris · on Feb 19, 2024

Yes that's why I was curious on a comparison to Wireguard, because for some the advantage that the mesh/managed VPN setup brings may not be worth the performance trade-off.

jbott · on Feb 19, 2024

I think you misunderstand how mesh VPNs work. Their primary purpose is as a control plane - introducing peers to each other so they can either communicate directly or via a relay (eg DERP) via per-node encryption. They should have no overhead compared to a single point to point encrypted tunnel like wireguard, because the “mesh” features are not in the data path.

The only real difference here is how the vpn product implements wireguard: userspace or kernel space, and how well tuned that implementation is. It might make sense to compare wireguard implementations, but (afaik) all are using one of several open source ones. Tailscale did some work to improve performance that they blogged about here https://tailscale.com/blog/more-throughput

scottyeager · on Feb 19, 2024

Neither Nebula nor ZeroTier is based on Wireguard.

What they compare in the article are systems that provide some form of ACL, which is why bare Wireguard is not included. That means there are features in the data path that could have significant performance implications versus a simple tunnel. The impact of using ACL features isn't really a focus of the presented benchmarks, but they do mention a separate test of using iptables to bolt on access controls.

asimops · on Feb 19, 2024

This looks like a full VPN service for me. Since wireguard is only the session part of a VPN service, i.e. missing session management and key distribution, this would not be a proper comparison IMHO

wdh505 · on Feb 19, 2024

Fantastic

abdulmuhaimin · on Feb 19, 2024

The title alone is already contradictory

racingmars · on Feb 19, 2024

Even without reading the article, you can imagine things like "Runner A beats runner B in a sprint. Runner B beats Runner A in a marathon. Which one is faster?"

The title suggests that different VPNs are fastest in different measures/tests, so no single VPN is clearly "fastest" in aggregate.

abdulmuhaimin · on Feb 19, 2024

yes, the article suggest that. Not the title.

The equivalent title for your analogy would be "Runner A is not the fastest runner, but neither are the others"

blacklion · on Feb 19, 2024

Openness about benchmarks are great, but for me (and I think many others) these benchmarks of mesh vpn are useless.

If I have all hosts on 10G physical segment why would I use mesh vpn between them?

IMHO, interesting case for mesh vpn is very heterogeneous setup: fir example, 2 hosts in different DCs, 2 hosts on assymetrical ove4subscribed domestic links (ADSL/DOCSIS), 2 mobiles in different ends of the world (lte at best) and 2 laptops on cafee wifis (again, in different countries).

Then it IS mesh network.