Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Long polling has some problems of its own.

Second Life has an HTTPS long polling channel between client and server. It's used for some data that's too bulky for the UDP connection, not too time sensitive, or needs encryption. This has caused much grief.

On the client side, the poller uses libcurl. Libcurl has timeouts. If the server has nothing to send for a while, libcurl times out. The client then makes the request again. This results in a race condition if the server wants to send something between timeout and next request. Messages get lost.

On top of that, the real server is front-ended by an Apache server. This just passes through relevant requests, blocking the endless flood of junk HTTP requests from scrapers, attacks, and search engines. Apache has a timeout, and may close a connection that's in a long poll and not doing anything.

Additional trouble can come from middle boxes and proxy servers that don't like long polling.

There are a lot of things out there that just don't like holding an HTTP connection open. Years ago, a connection idle for a minute was fine. Today, hold a connection open for ten seconds without sending any data and something is likely to disconnect it.

The end result is an unreliable message channel. It has to have sequence numbers to detect duplicates, and can lose messages. For a long time, nobody had discovered that, and there were intermittent failures that were not understood.

In the original article, the chart section labelled "loop" doesn't mention timeout handling. That's not good. If you do long polling, you probably need to send something every few seconds to keep the connection alive. Not clear what a safe number is.



Every problem you just listed is 100% in your control and able to be configured, so the issue isn't long polling, it's your setup/configs. If your client (libcurl) times out a request, set the timeout higher. If apache is your web server and it disconnects idle clients, increase the timeout, tell it not to buffer the request and to pass it straight back to the app server. If there's a cloud lb somewhere (sounds like it because alb defaults to a 10s idle timeout), increase the timeouts...

Every timeout in every hop of the chain is within your control to configure. Setup a subdomain and send long polling requests through that so the timeouts can be set higher and not impact regular http requests or open yourself up to slow client ddos.

Why would you try to do long polling and not configure your request chain to be able to handle them without killing idle connections? The problems you have only exist because you're allowing them to exist. Set your idle timeouts higher. Send keepalives more often. Tell your web servers to not do request buffering, etc.

All of that is extremely easy to test and verify functioanlity. Does the request live longer than your polling interval? Yes? Great you're done! No? Tune some more timeouts and log the request chain everywhere you can until you know where the problems lie. Knock them out one by one going back to the origin until you get what you want.

Long polling is easy to get right from an operations perspective.


Whilst it's possible you may be correct, I do have to point out you are, I believe, lecturing John Nagle, known for Nagle's algorithm, used in most TCP stacks in the world.


He has a valid criticism. It's not that it can't be fixed. It's that it's hard to diagnose.

The underlying problems are those of legacy software. People long gone from Linden Lab wrote this part of Second Life. The networking system predates the widespread use of middle boxes. (It also predates the invention of conflict-free replicated data types, which would have helped some higher level consistency problems.) The internal diagnostic tools are not very helpful. The problem manifests itself as errors in the presentation of a virtual world, far from the network layer. What looked like trouble at the higher levels turned to be, at least partially, trouble at the network layer.

The developer who has to fix this wrote "This made me look into this part of the protocol design and I wish I hadn't."

More than you ever wanted to know about this: [1] That discussion involves developers of four different clients, some of which talk to two different servers.

(All this, by the way, is part of why there are very few big, seamless, high-detail metaverses. Despite all the money spent during the brief metaverse boom, nobody actually shipped a good one. There are some hard technical problems seen nowhere else. Somehow I seem to end up in areas like that.)

[1] https://community.secondlife.com/forums/topic/503010-obscure...


> Whilst it's possible you may be correct, I do have to point out you are, I believe, lecturing John Nagle, known for Nagle's algorithm, used in most TCP stacks in the world.

Thank you for pointing that out. This thread alone is bound to become a meme.


[flagged]


I'm sorry, I didn't mean to embarrass you. I only meant to point out that people would err on the side of John likely knowing what they were talking about, whilst you seem to confidently have some misunderstandings in your comment.

For example, you said "Every timeout in every hop of the chain is within your control to configure", but I'm quite confident that the WiFi router my office uses doesn't respect timeouts correctly, nor does my Phone's ISP, or certain VPNs. Those "hops" in the network are not within my control at all.


> I better go kill myself from this embarrassment so my family doesn't have to live with my shame!

There's no need to go to extremes, no matter how embarrassing and notably laughable your comment was. I'd say enjoy your fame.


> There's no need to go to extremes

Agreed, honestly if an argument is made and it makes sense, it doesn't matter who is on the other side.

> no matter how embarrassing and notably laughable your comment was.

I wouldn't even call the original comment laughable, they had a point - if you are in control of significant parts of the overall solution, then you can most likely mitigate most of the issues. And, while not a best practice, there's nothing really preventing you from sneaking in the occasional keepalive type response in the stream of events, if you deem it necessary.

The less of the infrastructure and application you control, the more likely the other issues are likely to pop up their heads. As usual, it depends on your circumstances and both the original comment and the response are valid in their own right. The "extreme" response was a bit more... I'm not sure, cringe? Oh well.


I bet there's an online college credit transfer program that'll accept this as a doctoral defense. Depending on how Nagle's finagled.


Oh my gosh :-D


[flagged]


But it's easier to lecture someone on a bug they already diagnosed and explained to you.


We must be reading different comments


> Every timeout in every hop of the chain is within your control to configure.

lol


I wasn't talking about network switch hops and if you're trying to long polling and don't have control over the web servers going back to your systems then wtf are you trying to do long polling for anyway.

I don't try to run red lights because I don't have control over the lights on the road.


Thus, the advice to not run the red light..


[flagged]


Re-read the post, there’s more in the path than just your client and server code, and network switches aren’t the problem. The “middle boxes and proxy servers” are legion and you can only mitigate their presence.

You’ve been offered the gift of wisdom. It’d be wise on your part to pay attention, because you clearly aren’t.


That race condition has nothing to do with long polling, it's just poor design. The sender should stick the message in a queue and the client reads from that queue. Perhaps with the client specifying the last id it saw.


And it's worth noting that you can't just ignore this problem if you're using websockets - websockets disconnect sometimes for a variety of reasons. It may be less frequent than a long-polling timeout, but if you don't have some mechanism of detecting that messages weren't ack'd and retransmitting them the next time the user connects, messages will get lost eventually.


> That race condition has nothing to do with long polling, it's just poor design. The sender should stick the message in a queue and the client reads from that queue.

How does that help? You can't pop from a queue over HTTP because when the client disconnects you don't know whether it saw your response or not.


The next long-polling request can include a list of the ID(s) returned in the previous request. You keep the messages in the queue until you get the next request ack'ing them.


That means you have to keep them in the queue until the next time the client connects, which could be a very long time


To get reliable message delivery, you have to do that when using WebSockets or SSE too, because those also disconnect or time out depending on upstream network conditions outside the client's control, and will lose messages during reconnection if you don't have a sender-side retransmt queue.

However, queued messages don't have to be kept for a very long time, usually. Because every connection method suffers from this problem, you wouldn't usually architect a system with no resync or reset strategy in the client when reconnection takes so long that it isn't useful to stream every individual message since the last connection.

The client and/or server have a resync timeout, and the server's queue is limited to that timeout, plus margin for various delays.

Once there's a resync strategy implemented, it is often reasonable for rhe server to be able to force a resync early, so it can flush messages queues according to other criteria than a strict timeout. For example memory pressure or server restarts.


With websockets, the client can immediately acknowledge receipt of a message, since the connection is bidirectional. And on a resync the server can just send events that the client never acknowledged.


That (to quote you) means you have to keep them in the queue until the next time the client connects, which could be a very long time.

To be clear, there's no real difference - in both cases you have to keep messages in some queue and potentially resend them until they've been acknowledged.


As the other guy said, that's why I mentioned using an ID. And as the other guy said, the same as required regardless of what channel you're using.


I'm new to websockets, please forgive my ignorance — how is sending some "heartbeat" data over long polling different from the ping/pong mechanism in websockets?

I mean, in both cases, it's a TCP connection over (eg) port 443 that's being kept open, right? Intermediaries can't snoop the data if its SSL, so all they know is "has some data been sent recently?" Why would they kill long-polling sessions after 10sec and not web socket ones?


Sending an idle message periodically might help. But the Apache timeout for persistent HTTPS connections is now only 5 seconds.[1] So you need rather a lot of idle traffic if the server side isn't tolerant of idle connections.

Why such a short timeout? Attackers can open connections and silently drop them to tie up resources. This is why we can't have nice things.

[1] https://httpd.apache.org/docs/2.4/mod/core.html#keepalivetim...


Wow, 5 seconds is pretty aggressive! For example nginx has 60s as default, probably allowed by its event driven architecture, which mitigates some of the problems with “c10k” use cases.

Anyways, the real takeaway is even if your current solution works now, one day someone will put something stupid between your server and the clients that will invalidate all current assumptions.

For example I have created some service which consumes a very large NDJSON file over an HTTPS connection, which I expect to be open for half an hour at least, so I can process the content as a stream.

I dread the day when I have to fight with someone’s IT to keep this possible.


> If the server has nothing to send for a while, libcurl times out. The client then makes the request again. This results in a race condition if the server wants to send something between timeout and next request. Messages get lost.

I think it's a premise of reliable long-polling that the server can hold on to messages across the changeover from one client request to the next.


Yeah, some servers close connections when there’s no data transfer. When the backend holds the connection while polling the database until a timeout occurs or the database returns data, it needs to send something back to the client to keep the connection alive. I wonder what could be sent in this case and whether it would require special client-side logic.


In the HTTP model, technically status code 102 Processing would fit best. Though, no longer part of the HTTP specification.

https://http.dev/102

100 Continue could be usable as a workaround. Would probably require at a bare minimum some extra integration code on the client side.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: