TCP Puzzlers (2016)

testemailfordg2 · on Feb 21, 2024

"It's possible for one system to think it has a working, established TCP connection to a remote system, while the remote system knows nothing about that connection."

I have this problem right now, between applications in on-premise DC and public cloud, talking over a VPN. If there is a blip on the network, or a firewall failover happens, the other side continues to think its connected. So, to ensure my application side is resilient, we configure the listeners to go back to listening mode, if nothing received for 5 minutes. This ensures we don't have to manually intervene and the connection comes back up on its own.

red_admiral · on Feb 21, 2024

We wouldn't have had heartbleed if the SSL/TLS designers hadn't thought it necessary to include a heartbeat protocol - precisely because they couldn't trust TCP on its own to do this. (Why one would ever want to send a 4KB heartbeat packed beats me though.)

taneq · on Feb 21, 2024

Yeah, I've seen a lot of code doing "the right thing" and trusting TCP to monitor connections. I trust none of it. My application layer always pings some status back and forth at regular intervals and both ends will disconnect after an explicitly set timeout. I'm sure there are situations (probably most static web server stuff) where this is a terrible idea but for realtime(ish) remote control it's the only way to go.

klabb3 · on Feb 21, 2024

In practice, I’ve noticed keep-alives are unreliable especially for shorter timeouts on say websockets where you need a bit tighter intervals. Every OS does something different and usually the timeouts you get are quite long (and unknown to the application layer iirc - an opaque mix of os level knobs/multipliers and per-socket options).

Afaik current best practice is to reimplement application level keepalives everywhere. But it seems to me like both protocol pollution and prone to subtle leak-like bugs for dead conns. It would be nice if keep-alives could be used instead.

Also: NATs and firewalls put timeouts in their “routing tables”(?) so that conns are dead after X seconds anyway. Does anyone here know from experience what X is typically, in practice?

suprjami · on Feb 21, 2024

Linux has socket options which let socket owners query and set TCP Keepalive parameters. See `TCP_KEEPCNT` and friends: https://www.man7.org/linux/man-pages/man7/tcp.7.html

The usual term for the table is NAT table or connection tracking table.

The usual timeout is as long as a piece of string.

I've seen timeouts as short as a few seconds. These are sometimes configured in environments where applications are expected to pause for longer than a few seconds, so the software reports a timeout on many actions.

I've also seen timeouts hours long. These are sometimes configured in environments where tuples are reused rapidly so the NAT device drops new connections which land on the same tuple because those connections are "old".

Networking is great.

klabb3 · on Feb 21, 2024

Thanks for this.

> See `TCP_KEEPCNT` and friends

Yeah, 3 options with manually tuning even the number of probes. And this is for one OS only. This is one reason why application level pinging is just more feasible. I am using Go and gave up on tcp keepalives (even though they do have an std api) because the resulting behavior was a mess.

I don’t know if it should be the job of stdlibs or the OS, but at application level I’d prefer something like a single keepalive param with reasonable behavior.

suprjami · on Feb 21, 2024

Yeah, what you say makes sense. I am sure you are not the first person to think of this. I wonder if a pre-made library exists for it? I couldn't find anything with a quick search.

bewo001 · on Feb 21, 2024

From my VoIP experience, there is no 'typical' in NAT. There have been attempts to classify NATs (https://www.rfc-editor.org/rfc/rfc3489, Section 5), but they have been abandoned after seeing how NATs sometimes change their behaviour dependent on their load, the number of connections, the phase of the moon etc (https://www.rfc-editor.org/rfc/rfc4787, Section 3). So, yes, application-level keepalives are your best bet.

gcr · on Feb 22, 2024

Since this was written in 2016, the webpage broke the formatting of the code samples, causing all of the outputs to be collapsed to one line.

Luckily, the Internet Archive has a version with correct formatting, which I find miles easier to read: https://web.archive.org/web/20220823105029/https://www.trito...

morning-coffee · on Feb 21, 2024

Puzzler 3, to me, implies the author assumed both sides come to know of the state of a single duplex connection, rather than both having their own state of a simplex connection. TCP has always been designed to allow one side to continue receiving after indicating to the other side they are finished sending, and it requires both sides to close before the whole connection can end.

https://www.rfc-editor.org/rfc/rfc793#section-3.5

drougge · on Feb 24, 2024

I have felt for a long time that this half closed state was a mistake. It fails to match naive expectations and can be hard to handle correctly even if you know about it. And as far as I can see there is no real benefit to having it. Is there some real use case for it (i.e. not just a minor convenience)?

dang · on Feb 21, 2024

TCP Puzzlers - https://news.ycombinator.com/item?id=12315814 - Aug 2016 (70 comments)

zokier · on Feb 21, 2024

The code blocks seem to have gotten mangled in the process somehow :(

pimlottc · on Feb 21, 2024

I thought it might be a CSS issue but it seems like the actual newlines have been lost.

sdkgames · on Feb 21, 2024

The old version [0] has the correct layout.

[0] https://web.archive.org/web/20221010031002/https://www.trito...

bwann · on Feb 21, 2024

truss, that's a command I haven't seen in a long time