Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
We Built Our Own DNS Infrastructure (replit.com)
251 points by amasad on April 29, 2021 | hide | past | favorite | 64 comments


I did exactly this 15 years ago. I was a C++ developer and there was no golang around yet. And Bind had fresh security bugs every month. So the easiest and safest thing I could do was use djbdns (tinydns) with its cdb files being recompiled whenever records had to be added or updated.


CDB was also (by design) a far nicer programming interface than zone files, which are archaic.


Curious to know if anyone's hosting real-world production web apps or APIs on replit, or is it mainly an educational platform?


Some companies host bots, internal tools, and increasingly micro-services. But primarily side-projects and increasingly early-stage startups (e.g. https://blog.replit.com/blubbr).

One fresh thing is plenty of developers in India are hosting covid support apps on Replit. This particular app has done 1m+ hits in the past 24 hours: https://covid.army/

You can add __repl after any app to get the source: https://covid.army/__repl


This is so cool. I love how Repl.it is making a positive impact by empowering the developers.


How is this different from glitch.me? I am genuinely curious. Looks like a static Next.js app. I am a regular replit user, I want to know if the support of node/deno is complete (it hasn't been really responsive so far).


It’s better, more robust, faster, and more flexible. Bigger user base and network of creators and apps to draw on.

Node support is A* and deno is still experimental.


Speaking of runtime support, any reason why C# is using Mono instead of official cross-platform .NET Core 3.x or the latest .NET 5, and it also looks like ASP.NET is unsupported. Any plans to address these in the near future?


The article mentioned that the blog itself is hosted on a replit, so I guess that's one example.

This seems like it would be great for rapid prototyping.


It's marketed as an IDE so hopefully not (I'm not sure you even could?)


Our marketing so embarrassingly out-of-date right now but yes we have a very robust hosting offering. It's still meant for side-projects and maybe early-stage startups.

I wrote something on what it feels like to host an app from your editor: https://amasad.me/hosting


Unfortunately my experience of replit (is that the correct spelling now?) is that it doesn't work, even for development. I have the simplest create-react-app project [1], and not only does it take 10min to npm install, it then doesn't start (runs out of memory, gets killed). I know frontend development is in a bit of a weird place right now, but if it doesn't work for this...

I last tried it 8 months ago but I get the same thing now. If it takes seconds on GitHub's free CI and fails after 10min on your platform, then I really can't use it.

    The build failed because the process exited too early. This probably means the system ran out of memory or someone called `kill -9` on the process.
[1]: https://github.com/remram44/twitch-vod-sync


Create React App is a crime against computers. It's an absolute resource hog, which maybe is okay for Facebook engineers with $3000 MacBooks, but we can't give that much resources to free users. To use React on Replit you should use Vite. We were on HN just yesterday talking about it, it's an absolute delight: https://news.ycombinator.com/item?id=26972400

If you really want to use CRA on Replit, you should buy the hacker plan for $7/month which also lets you boost select repls and it will then be very usable for you: https://blog.replit.com/boosts


Given how spiky a dev environment would seem to be from a CPU perspective, can you pack a large number of free customers on a single very large VM to allow those occasional spikes without requiring a huge investment?

Anyone abusing that free cluster, you boot over to a more constrained and throttled environment.

Appreciate the candor and thanks for the blog post. Clearly an environment like this is the future development.


I lol-ed and also like your frank answer. well done.


I guess my real point is that common workloads that work for everyone don't work on your development environment. I could pay for a bigger environment from you, but really this does not make me want to move any workflow to your platform (especially not production deployments!).

Sure, I can try and sell a whole different tool to my team, or accept that they just want to get things done and keep using the current tool that works everywhere but on replit.

I get where you're coming from, and I might take your advice next time I have to do something frontend-related (which hopefully I won't), but also understand that this works fine on most other tools' free tiers. In fact it builds from scratch and publishes to GitHub pages in 50s in GitHub Actions.


That's a world where megacorps and VC-funded fartups are the only thing left, and they can give you lucrative free tires because either they're FU big or they're just burning VC money to lock you in and eventually sell your data or extort the money from you once they've become dominant on the market.

I get where you're coming from; you just want free stuff. Understandable. But also understand that there are people who want to operate and support different kinds of businesses. For example, we might pay a few dollars a month for someone to host a git repo: https://sourcehut.org/pricing/

Offering service actually costs money, and it has to come from somewhere.


This is $7/mo before I even get to try it on real workloads though. There is no trial period or limited setup in which I can evaluate your system on a real (but simple) workload before grabbing my credit card. Surely you can see how that would turn people away...


Yeah, sometimes you have to shell out.

Anything that can run a "real workload" is these days going to be heavily targeted by buttcoin miners and other abuse. Any free trial is ripe for abuse, and it becomes a cat & mouse game to try to identify and stop them before they waste all your resources. Is that fight worth fighting for a small business? Probably not. Yeah, you might lose a stingy few potential would-be customers but that may cost less than the fight. I bet most of these small businesses will be happy to refund your $7 if you try their service, find out that it's not for you, and explain your situation nicely.


The most performant MacBook on the market costs $1000, just saying (M1 MacBook Air). You can get one with a larger screen if that’s your thing. But $1000 gets you more than enough to run create react app.

You really shouldn’t slag off people who are working hard to give you something for free. And if you are, at least get your facts straight.


Marketing and positioning definitely needs a push towards robust and scalable hosting.

Integrated environments (browser IDE + CI/CD + serverless hosting) are going to be big for starter apps and quick APIs.

I spent some time reviewing Google Cloud Shell Editor, which has git support (not GitHub yet I guess) and one-click build & deployment to Google Cloud Run, which overall is a powerful integrated environment in itself.

Imo, replit should preemptively focus on improving and highlighting hosting.


You're right -- thanks for the feedback.


Very cool! So, would it be achievable to use Repl to both host and create (eg sort of as a web/code based cms) posts for a Jekyll (or some other ssg) blog, but/and also have the content be git-integrated? Would I just use a guide like this?: https://replit.com/talk/learn/GUIDE-MAKE-A-BLOG-USING-JEKYLL...


I wonder about the trade-offs in writing a completely new server compared to a backend for BIND or PowerDNS.


Straightforward authority servers like these are very easy to write, in memory-safe languages. It might actually take longer to figure out how to get a BIND configuration doing what Replit does here, but even if it didn't, the resulting server is much safer than BIND, and does exactly and only what you want it to do.

I think: a pretty easy engineering decision.


I don't disagree overall that a competent team really could get this right in the same time it'd take to make it work how you wanted with BIND, but although DNS is simple there are some corner cases where an incorrect solution will appear to work in trivial scenarios - "It worked in my Chrome" but may have either functional problems or security problems.

For example the 0x20 trick. The specification for DNS is clear that you aren't supposed to care about bit 0x20 in labels. ClOWnS and cLowNs and clowns and CLOWNS are all the same label as far as DNS is concerned.

However your answers need to bit-for-bit match the question you were asked. So if you answer "ClOWns A?" with "CLOWNS A 10.20.30.40" that's a mistake, you were asked about "ClOWns" not "CLOWNS". In 1995 if your DNS server got this wrong nothing of consequence breaks. But in 2021 if you get this wrong some important things magically don't work.

The transaction IDs that should make forging DNS answers hard are very short, and so to beef that up slightly some stacks will hide more bits in the 0x20 bit of labels where they will be echo'd back by a compliant implementation. But to reap this reward they must ignore answers that get the 0x20 bits wrong, like yours.

I feel like if "You can't get this wrong" (a stronger claim that you admittedly didn't make) was true, my visits to the Let's Encrypt community site wouldn't all begin by ignoring the people whose problem is obviously just that their DNS server doesn't work properly. Some of them have problems an authority server doesn't care about, but lots of them have dumb problems you'd imagine are impossible and yet apparently people have successfully sold commercial DNS servers with those problems.


Your comment suggests that the answer records in a DNS response need to be bit-for-bit identical to the original query. But the Vixie 0x20 draft says only that the question section in the response needs to be identical for this trick to work --- which is the ordinary way you'd implement an authority server (answers might come from a database or whatnot, but in both miekg/dns and the Rust NLNet library, the natural way to formulate a response simply copies the original query record).

At any rate: the possibility of breaking a 0x20-enforcing resolver scares me a lot less than depending on BIND, whose last memory corruption vulnerability was announced (checks notes) yesterday.


How is BIND still so bad 20 years after everyone already knew it was so bad :(.


You should rewrite it in rust


Bit by bit, that's what everyone is doing.


For anyone curious about 0x20:

* https://tools.ietf.org/html/draft-vixie-dnsext-dns0x20

I'm not surprised that Vixie is involved. :)


The simple solution to this is to return the same binary query as was received from the request.

It’s easy to screw up, definitely agree, but fairly clear how to fix when it’s pointed out. (I made that mistake)


If you don't mind, what things would magically break if the case of the answer does not match? Comparisons should be case-insensitive anyway.

Also, since DNS labels are strictly ASCII (this is why punycode exists), why converting all of them to the canonical upper case won't be a good idea?


There's a lot of daft DNS code out there that makes strange assumptions. My favorite example isn't quite relevant here but I'll mention it anyway because I think it paints the picture well: PowerDNS purportedly added compression support in its responses because a common stub resolver required answer owner names to be a compression pointer to the question.

EDIT: Found a reference:

"Turns out some customers were using a CPE router that thought ‘C0 0C’ was some kind of ‘answer starts here’ marker. And if it did not find that marker, its DNS component would crash. And at that time, PowerDNS did not compress that first response record, so there was no ‘C0 0C’."

-- https://berthub.eu/articles/posts/history-of-powerdns-2003-2...


The random mixed-case patterns are used to reduce the probability that someone evil could send spoofed DNS replies to your queries and have your DNS resolver trust those fake replies.


But if an adversary can send a spoofed reply with the desired domain name at all, I expect that the adversary could read the original request packet, too?


They're racing. The adversary needs to have their spoofed reply arrive first so you'll accept it as genuine. They will most often seek to arrange to reply to a query they guess you've asked, such that their answer arrives after it's asked but before you receive the honest answer.

This is why that transaction ID matters, the honest answer will copy the transaction ID verbatim from your question in the answer, so you get to pick it at random (back when I was a child it might just be a sequential counter) and your adversary has to guess it. But, alas the ID isn't very wide, so they really do have a good chance to just guess it. Hence, let's hide more random bits elsewhere in our queries to get a better chance of foiling the adversary.

How does an adversary guess what you're asking? Well, for one thing they might have chosen the question you're about to ask. When a bad guy's web site has an image at the top with <IMG SRC="http://real.website.example/header.jpg"> doesn't your web browser try to look up real.website.example to go get the image? Very predictable.


No, you can do it blindly. For example, if you have a web page that has an <img> tag pointing to a specific domain, then you can assume the client will perform a DNS lookup, so you can send fake replies blindly, hoping it will match a lookup request that you don't see.

Adding mixed case matching makes it more difficult to make a lucky guess when sending a fake reply blindly.


To add on to this, DNS is meant to be a simple protocol. The problems we encounter with it today are usually due to the thousands of little patches added on through the decades to tack functionality onto what is essentially supposed to be a key-value store.


Hi, author here! tptacek is right on the money. The authority server is really small and simple, and writing it in Go meant we had access to our existing internal packages that had the logic to fetch the data we need for each DNS query. This seemed like the most straight-forward path.


Hi, PowerDNS dev here! Would you mind fixing your handling of NODATA answers, which are currently lacking the AA bit and a SOA record in authority? It does not seem to affect all instances, so perhaps it only happens on the legacy infra. You can get the details here, for example: https://dnsviz.net/d/b.b.b.b.b.b.b.b.a.a.a.a.a.a.nope.repl.c...


Nice catch! I'll get that fixed up. Thanks :)


miekg/dns is really excellent. Our custom DNS server is written in Rust (with the NLNet libraries, which are also great), but I used miekg/dns to throw together a DNS telemetry system that we use to keep metrics on our DNS (and UDP) service from around the world using off-net hosts, which sounds cool to type out but was an absurdly simple coding project because of how good the libraries are.

More people should do cool weird stuff with DNS. (And Replit should host stuff on Fly! But also the DNS stuff we're talking about.)

It's a great post, thanks for writing it.


> And Replit should host stuff on Fly! But also the DNS stuff we're talking about.

Having used both, I'd wager replit competes with fly.


That wasn't a dig!


We had some success with CoreDNS (over BIND et al): https://coredns.io/manual/toc/

Switched later to OctoDNS, mostly because we didn't want to run DNS infrastructure or deal with racing updates to records: https://github.com/octodns/octodns


At work we have a service built on powerdns-backend (http). The biggest pain there was lack of docs. There's circumstances where powerdns would make multiple requests to the backend app server, and figuring out how to answer its first direct queries in a way to satisfy it on the first query took some trial and error (this was a few years ago, so I don't remember specifics). I seem to remember it asking the backend for ANY records and then a bunch of metadata stuff, which meant having to spend time understanding not just how to answer the simple DNS TXT and A queries we were trying to do, but understanding how powerdns would interpret the client query, pass it to the backend, and interpret the result (which was quite different from straight DNS).

That also made it really hard to test, short of setting up a full end-to-end integration test including running powerdns (which I think we have, but isn't fully automated).

If I was building that service again, I'd definitely have a serious look into building it directly as a standalone DNS server rather than a backend for something else.


Hello! PowerDNS developer here. Did you spot https://doc.powerdns.com/authoritative/appendices/internals.... and https://doc.powerdns.com/authoritative/appendices/backend-wr... ?

And if not (or also if you did), can you suggest documentation additions that would have helped you here?


I think that page would have been incredibly useful, but doesn't look like it existed.

I was just looking through the source history of the project a bit as I wasn't the original developer (I've just done some fixes; most recently fixing some things going from 4.2 to 4.4). The original code was written in 2016, when the documentation was a lot more sparse [1]. Kudos for the improvements!

I just filed a PR [2] fixing the doc issues I ran into (probably should have done that at the time).

To enumerate some things I see right now (using remote http backend):

* pdns always queries /lookup/example.com./SOA, /getAllDomainMetadata/example.com. /lookup/example.com./ANY -- in my case that seems a bit wasteful: there's no metadata, and SOA is returned in the /ANY request anyway. I'm unclear if there's a misconfiguration, something we're doing wrong, or this is just "how it works".

* If I query `some.domain.example.com.` (and we aren't serving requests for any part of that), the backend returns `{result:false}` -- and then pdns proceeds to query `domain.example.com.`, `example.com.`, `com.`, and `.` before finally returning SERVFAIL. It would be nice if (1) it didn't have to run so many queries, but also (2) my understanding is the appropriate response should be REFUSED (as this is not a general DNS resolver) -- but I don't know how to get PowerDNS to do that.

That last bit is what I've found most frustrating about working with pdns: everything feels like "just do what pdns asks and it'll sort out what it wants to do as a result" and I can't just tell it "returned REFUSED for that query". Maybe some examples would help with this?

[1] https://web.archive.org/web/20170615183449/https://doc.power...

[2] https://github.com/PowerDNS/pdns/pull/10345


Hey, thanks for the PR, love that!

The wasteful SOA query is a result of our internal code flow. You can reduce wasteful queries a bit with the new `consistent-backends` setting, but the SOA query will remain.

We (no longer) have a knob to disable metadata, but there's a cache that should also remember your empty response (the default value for `domain-metadata-cache-ttl` is 60 seconds).

If you have a SOA for example.com, pdns will never return REFUSED for anything inside example.com. The zone is there, so an answer must be present - either a name with records, a name with no records for that type, or the name does not exist. (The distinction between the latter two is why you mostly see ANY queries instead of the type the client asked for).

As far as I can tell, you should never be returning 'false' to anything, as that indicates failure, which is different than 'I know I do not have what you are asking for'. Think of it like this: your SQL database does not go 'oh no' when it has zero rows for you; it just gives you zero rows. A pdns remote backend should behave the same. If you return 'empty' for those lookups into domains you have nothing for, instead of false, I expect REFUSED will come out instead of SERVFAIL.

(And, if I'm correct in the previous paragraph, your PR is correct too :-) )


Unbound is pretty popular with the folks I work with, but in this case, they are only serving up things they are authoritative for, so they don't need much.


Isn't unbound a resolver, and not an authoritative server (nsd) ?


Ah, yep, right. Argh, too late to edit/delete.


A good case of situational pragmatic wisdom > common wisdom.


A good case of situational psychosis widow > common widow.


Interesting! How do you handle apex custom domains with registrars that require an A record?


Hey anurag, big fan of Render! We currently don't have a great solution for this and we recommend that users use a DNS provider that has support for something like an ALIAS record.


> Hey anurag, big fan of Render!

Thanks Connor!

Did you consider creating CNAME records with your existing DNS provider to point to the target cluster proxy for each repl.co subdomain?


Good question! One of the motivations for splitting our infrastructure into clusters was to separate paying and non-paying customer. There's various reason for this, reducing the size of our failure domains and running better hardware for paying customers. However, this means users can be transitioned between clusters as they become a paying customer or back to a non-paying customer. Since repl.co subdomains are tied to the user/repl and not to the particular back-end cluster, we'd have to maintain way too many records with our existing DNS provider. Even if we could, we'd need to make sure the records are always synchronized. In the end, it made more since to roll our own authority server that can just query our data store and generate the right DNS answers on the fly.


Makes sense. Thanks for the article and explanation.


> We automatically detect the web server and open a webview in the workspace

And you have to use one of their package buttons to make that happen. Super frustrating that they don’t just have an button to open a web view on a port.


I'm not sure what you're referring to. You can use any web server library/framework you'd like, we open the web view whenever your program starts listening on a port. We can't open the web view any earlier since there would be nothing to load in the web view.


I have been working on this for a bit as well but did not figure it out completely yet (mostly due to lack of time): I want to create enough to have loadbalancing/failover on dns level (like cloudflare/route53). Any tips? Open source and on premise hosted.


The usual implementation is you simply check every few minutes which nodes are up and then in your A response you return all the working addresses.... are you trying to do something more complex?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: