CF had their route covered by RPKI, which at a high level uses certs to formalize delegation of IP address space.
What caused this specific behavior is the dilemma of backwards comparability when it comes to BGP security. We area long ways off from all routes being covered by rpki, (just 56% of v4 routes according to https://rpki-monitor.antd.nist.gov/ROV ) so invalid routes tend to be treated as less preferred, not rejected by BGP speakers that support RPKI.
PMTU just doesn't feel reliable to me because of poorly behaved boxes in the middle. The worst offender I've had to deal with was AWS Transit Gateway, which just doesn't bother sending ICMP too big messages. The second worst offender is, IMO (data center and ISP) routers that generate ICMP replies in their CPU, meaning large packets hit a rate limited exception punt path out of the switch ASIC over to the cheapest CPU they could find to put in the box. If too many people are hitting that path at the same time, (maybe) no reply for you.
More rare cases, but really frustrating to debug was when we had an L2 switch in the path with lower MTU than the routers it was joining together. Without an IP level stack, there is no generation of ICMP messages and that thing just ate larger packets. The even stranger case was when there was a Linux box doing forwarding that had segment offload left on. It was taking in several 1500 byte TCP packets from one side, smashing them into ~9000 byte monsters, and then tried to send those over a VPNish network interface that absolutely couldn't handle that. Even if the network in the middle bothered to generate the ICMP too big message, the source would have been thoroughly confused because it never sent anything over 1500.
> The even stranger case was when there was a Linux box doing forwarding that had segment offload left on. It was taking in several 1500 byte TCP packets from one side, smashing them into ~9000 byte monsters, and then tried to send those over a VPNish network interface that absolutely couldn't handle that. Even if the network in the middle bothered to generate the ICMP too big message, the source would have been thoroughly confused because it never sent anything over 1500.
This is an old Linux tcp offloading bug; large receive offload smooshes the inbound packet, then it's too big to forward.
I had to track down the other side of this. FreeBSD used to resend the whole send queue if it got a too big message, even if the size did not change. Sending all at once made it pretty likely for the broken forwarder to get packets close enough to do LRO, which resulted in large enough packet sending to show up as network problems.
I don't remember where the forwarder seemed to be, somewhere far away, IIRC.
> PMTU just doesn't feel reliable to me because of poorly behaved boxes in the middle. The worst offender I've had to deal with was AWS Transit Gateway, which just doesn't bother sending ICMP too big messages.
This made me think of a series of "war room" meetings I had been part of early in my career. Strangely enough, also a defect revealed when the platform was low on memory. This was also the issue where I learned the value of documenting experiments and results once an investigation has taken a non-trivial amount of time. Not just to show management what you are doing, but to keep track of all the things you have already tried rather than spinning in circles.
The war room meetings were full of managers and QA engineers reporting on how many times they reproduced the bug. Their repro was related to triggering a super slow memory leak in the main user UI. I had the utmost respect for the senior QA engineer who actually listened to us when we said we could repro the issue way faster, and didn't need the twice daily reports on manual repro attempts. He took the meetings from his desk, 20 feet away, visible through the glass wall of the room we were all crammed into. I unfortunately didn't have the seniority to do the same.
Since I can't resist telling a good bug story:
The symptom we were seeing is that when the system was low on memory, a process (usually the main user UI, but not always) would get either a SIGILL at a memory location containing a valid CPU instruction, or a floating point divide by zero exception at a code location that didn't contain a floating point instruction. I built a memory pressure tool that would frequently read how much memory was free and would mmap (and dirty) or munmap pages as necessary to hold the system just short of the oom kill threshold. I could repro what the slow memory leak was doing to the system in seconds, rather than wait an hour for the memory leak to do it.
I wanted to learn more about what was going on between code being loaded into memory and then being executed, which lead me to look into the page fault path. I added some tracing that would dump out info about recent page faults after a sigill was sent out. It turns out all of the code that was having these mysterious errors was always _very_ recently loaded into memory. I realized when Linux is low on memory, one of the ways it can get some memory back is to throw out unmodified memory mapped file pages, like the executable pages of libraries and binaries. In the extreme case, the system makes almost no forward progress and spends almost all of its time loading code, briefly executing it, and then throwing it out for another process's code.
I realized there was a useful looking code path in the page fault logic we would never seem to hit. This code path would check if the page was marked as having been modified (and if I recall correctly, also if it was mapped as executable.) If it passed the check, this code would instruct the CPU to flush the data cache in the address range back to the shared L2 cache, and then clear the instruction cache for the range. (The arm processor we were using didn't have any synchronization between the L1 instruction and L1 data cache, so writing out executable content requires extra synchronization, both for the kernel loading code off disk, as well as JIT compilers.) With a little more digging around. I found the kernel's implementation of scatter gather copy would set that bit. However, our SOC vendor, in their infinite wisdom, made a copy of that function that was exactly the same, except that it didn't set the bit in the page table. Of course they used it in their SDIO driver.
bitbake -e <recipe> is super useful for that game. It dumps out a complete history of where all variables were set/changed, and their values along the way. I also use it to do what I call "variable shopping," where I roughly know what path/name content I need, but not what the variable it is in is called.
I spent a lot of time debugging an internal fork of an open source BGP implementation (really old quagga.) The confederation code always struck me as being nothing but weird exceptions to how BGP normally worked. I was happy to never hear a network engineer suggest confederations with a straight face.
With all that investment in addresses I'm surprised AWS is still the first cloud provider to charge for them. (As far as I know.) It will be interesting to see if other cloud providers will follow, and if the cloud providers compete over the price or just match AWS. It kind of feels like AWS charging for V4s will "give permission" to other providers to charge.
I'm also curious if the price will come down over time as addresses are yielded back. I guess it depends on if their goal is to recoup all the money they spent on addresses, or just to avoid running out.
I think the primary problem with domain fronting that ECH would solve is that ECH doesn't involve using a third party's domain name, potentially dragging a single third party into the censorship muck. My read of the support email signal shared is that AWS was unhappy that a domain they owned would likely become entangled. While ECH will still increase everyone else's risk that is sharing the same load balancer as a censorship target, it is at least a fully distributed risk, rather than requiring the client pick a specific domain or set of domains to pull into their fight.
What caused this specific behavior is the dilemma of backwards comparability when it comes to BGP security. We area long ways off from all routes being covered by rpki, (just 56% of v4 routes according to https://rpki-monitor.antd.nist.gov/ROV ) so invalid routes tend to be treated as less preferred, not rejected by BGP speakers that support RPKI.
reply