Hacker News new | past | comments | ask | show | jobs | submit | jeromegn's comments login

We don't use k8s and you don't have to either. This is for current and future users who absolutely want k8s. We are a compute provider after all and making it easy to host a great variety of apps is good for our users.


We really didn't expect so many people to read this like "Fly is going all K8s"! It's interesting.


There are a substantial number of frankly ignorant people on HN who see K8S and run for the hills just because it is something they genuinely do not comprehend or were burned by an inappropriate deployment. The irony is that you arent switching to K8s, you are simply offering it as a compatibility layer, and still seeing flak. Case in point.


It seems I misunderstood the article. I'm happy to read this in any case, thanks for the follow up!


fly.io employee here, it's basically an adaptor from the Kubernetes world of YAML to the fly.io world of Machines. How could we have framed it better so that it was more clear?


A more in-depth analysis of which parts of the Kubernetes spec are unsupported by this adaptor would be extremely useful in evaluating it's viability for any given use case.


If the article was summarized with a TL;DR or phrase along the lines you mentioned "it's basically an adaptor from the Kubernetes world of YAML to the fly.io world of Machines" that would have made it way easier to understand.

In my opinion, the concept is not trivial to grasp by oneself (meaning: is not trivial to drive into that conclusion even after reading the article). So, being explicit first, and guide the reader through that concept along the article would have made a big difference.


There’s a crate for tokio, so it’s not automatic but might still be interesting: https://lib.rs/crates/tokio-splice


There’s a col_version column in a clock table used for last-write-wins. In case a tie, the “biggest” value wins.


Oh nice. Looks like on closer inspection they're using Lamport Clocks, which track causation, but if ignore time, although time is mentioned somewhere as a possibility in hybrid systems someday, if I'm understanding it?

Looks like only a 2MB binary for the extension, so you could in theory just pack it with your app too.

I'm particularly interested because it seems like(For very small databases) you could use SyncThing as the sync backend by just periodically dumping your data to files(And making a new one once the file got too big).

I don't know how you could ever garbage collect the old files aside from some kind of manual "Delete everyone else's stuff and output your own big merged log" command, but it would be really cool to be able to make apps with P2P sync.

It also seems like you could put them in an http server and use it like an RSS feed. Or even serve them via torrents.


We’re using cr-sqlite as part of our distributed state propagation system. It is indeed easy to bundle in the app!

https://github.com/superfly/corrosion

It would be possible to distribute cr-sqlite changes in many different ways (like you said, http or torrents, etc.) since any change can be applied out of order.


I'm the one who created this incident on our status page. I've been overly cautious in resolving this incident, but at this point I think it's causing more harm than good to keep it unresolved on there.

I think it might've prevented users from posting on our forums or sending in an email (premium support). I can imagine users looking at the status page and mistakenly thinking their problems were related to the current incident.

I've interpreted "Monitoring" as essentially meaning: "this is fixed, but we're keeping a close eye on the situation". We do not yet have a formal process for incidents such as this one (but we are working on that).

If our users are having issues, that's a problem. Looking at our own metrics, the community forum and our premium support inbox: I don't believe this to be the case.

Perhaps we should've done a better job at explaining the exact symptoms our users might be experiencing from this particular incident.


I really appreciate the context. We have an SPA with the frontend deployed on vercel and a GraphQL backend hosted on fly. The outage yesterday manifested as 502 errors being delivered to users on the frontend. We had another outage alert at 08:00 PST this morning that lasted about 5-10 minutes. It seemed like the same issue, so we didn't report another incident.

I really like fly, and I think you all are building a great product, but it's looking likely that we're going to migrate off of it. The biggest driver of that has been communication and issues with the status page. Specifically,

- When an incident occurs, we're often among the first to report it on the forum. Over the last month, the status page has lagged pretty significantly behind the incidents. This makes it feel like the we're discovering the issue before fly (I don't know if that's true, but that's the perception). Given that our automated tools are alerting us, it's disconcerting to feel like we're keeping a closer eye on our box's health than our cloud provider (again, this is perception based on communication lag, not necessarily reality).

- We have had multiple outages over the last month. In the middle of an outage, while there is an incident banner displayed at the top of the page, all systems show green with 99.98% or 99.99% uptime. That makes us not trust the numbers on the status page. This reinforces the above perception that fly's systems aren't being accurately monitored. Even now, the status page shows 100% uptime for all systems yesterday and today, which is not true.

- We emailed yesterday about our frustrations and concerns - specifically talking about the disconnect between fly's status page and the multiple outages. We explicitly called out the two points above, and how the communication up to this point has been "We've implemented a fix and are monitoring it". We asked for more details about what occurred, and what was being done to mitigate it in the future. The response was pretty boilerplate: "We're sorry you're frustrated. Here are some credits. We've implemented a fix and are monitoring it. Please let us know if you are still encountering issues."

The incidents were a problem, but disconnect between what was communicated and what occurred through multiple channels is what's driving us to leave. Here's what likely would have convinced us to stay:

- Over-communicate during the incident. I'd prefer to see more status updates rather than fewer.

- Having clear, proactive incident notification. Even with automated monitoring, things will slip through the cracks, but everything over the last month has felt reactive.

- Make sure the status page clearly reflects reality. If the system is down and everything shows green, then I'm 1) frustrated, and 2) wondering what else is slipping through the cracks.

- Publish retro docs or incident reports after an incident. Specifically, report what changes are being made to prevent an outage going forward.

- Train the support staff to communicate directly with developers. Boilerplate emails that focus on empathizing rather than informing are generally frustrating. Especially if they don't actually answer the questions being asked. I get that it's not reasonable to expect a support person to have an in-depth technical conversation, but this is where public incident reports (or live incident pages) can be really helpful.

I think you all are making a great product, but the issues with alerting, monitoring, and communication are too impactful for our production application. I'm confident you'll figure it out, but it's unlikely that we're going to wait.


> I think it's causing more harm than good to keep it unresolved on there.

Sorry, what? You have an open incident that you think should be shown as resolved as not doing so "causes more harm than good"?

Right, so lying to your customers about the state of an incident is better than just telling them the truth?


When you place a fix into production, often it's the case you hope it resolves all issues and doesn't create new ones.

However, you don't know if it resolved everything because you are only working with the symptoms given by one user.

If another user has similar but not the same problem, they won't post about it if the situation is still unresolved. They dont know their case is different, and isn't being worked on.


> When you place a fix into production, often it's the case you hope it resolves all issues and doesn't create new ones.

I hope not. Relying on "hope" when fixing prod is not a recipe for success in my book. It should ideally be possible to recreate the problem in a lesser environment, or at least get a level of comfort that fix will work based more on fact than "hope" before applying it.

Even then, if you are relegated to the level of hope and prayer when trying to handle an incident, it still doesn't mean you should close it unless you are *certain* it's fixed.

You can mark it as mitigated or fix applied, monitoring for xx period before marking as resolved or similar, surely.


I wholly agree. From what I see OP also agrees since they will now be using a stricter criteria to enable them to close more incidents earlier and only reopen when it's proven that there are other issues.


This looks great. I might be giving this a shot for a use case we have. My main concern is the docs.rs-generated documentation is hard to use. I don't exactly know how things fit together. I'm sure I could figure it out via tests, but more docs and usage examples would help a lot.

As for your search for SWIM in the Rust ecosystem: I found a pretty good (well documented and tested) crate: https://crates.io/crates/foca


Thanks a lot, you can always file a issue on https://github.com/quickwit-oss/chitchat. We can improve the documentation, add more example and mostly learn from your use case.


Whoops. It used to. Going to fix that soon!


Looks like it was hidden, not on the main page: https://fly.io/feed.xml


We use containerd with the devmapper snapshotter! Works nicely.

We create a hard link to the resulting device for the root drive inside firecracker.


To be determined. We're hoping to contribute and use what's going to come out of hyper's h3 efforts (we use Rust for our reverse-proxy). There's not much there yet though: https://github.com/hyperium/h3

We're not in a huge hurry to support QUIC / H3 given its current adoption. However, our users' apps will be able to support it once UDP is fully launched, if they want to.


Are you using a custom reverse proxy? For a recent project I started with Caddy but ended up needing some functionality it didn't have, and didn't need most of what it did have. I'm currently using a custom proxy layer, but I'm concerned I might end up having to implement more than I want to (I know I'll at least need gzip). Curious what your experience at fly has been with this.


We are! It's Rust + Hyper. It is a _lot_ of work, but that's because we're trying to proxy arbitrary client traffic to arbitrary backends AND give them geo load balancing.

Writing proxies is fun. Highly recommended.


Cool, thanks!

I was actually just playing with Hyper for a few hours last night. Are you guys using async/await yet? Any suggestions for learning materials for async rust other than the standard stuff?


another stupid question, but can't help it: golang seems like a popular choice among network developers. Any reason that made fly.io choose Rust over golang for the proxy?


Because of JavaScript. Really!

We settled on Rust back when we were building a JS runtime into our proxy. It's a great language for hosting v8. When we realized our customers just wanted to run any ol' code, and not be stuck in a v8 isolate, we extracted the proxy bits and kept them around.

I think you could build our proxy just fine in Go. One really nice thing about Rust, though, is the Hyper HTTP library. It's _much_ more flexible than Go's built in HTTP package, which turns out to be really useful for our particular product.


What functionality did you need that Caddy didn't have?


Hey Matt! I'm referring to the ability to change the Caddy config from an API that is itself proxied through Caddy. Here's the issue which you very helpful in[0].

Ultimately I realized that most of what I needed from Caddy was really just certmagic, which has worked flawlessly since I got it set up. Plus I need the final product to compile into a single binary. Since my custom reverse proxy only took a few lines of code, I haven't worried too much about it. But there are a few features which I'll have to integrate eventually.

If I end up seeing myself headed down the path of making a full-fledged reverse proxy, I'll reconsider trying to implement my project as a Caddy plugin.

[0]: https://github.com/caddyserver/caddy/issues/3754


Their server and/or your client might not be setup for session reuse. Definitely a thing to check.


Thank you. I will make sure this is enabled.


Hey there, Fly co-founder here!

Fly's proxy uses a mix of tokio, hyper and rustls. We don't need to use a crate that handles ACME because we're processing all the validation and certificate authorizations from a centralized, boring, Rails application.

We've had to submit a PR to the rustls project a few months ago to handle different ALPNs. Instead of resolving a certificate only from a SNI, the crate now provides the full ClientHello which contains negotiable ALPNs. With that information you can respond to the tls-alpn-01 challenge.


Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: