I wonder about the trade-offs in writing a completely new server compared to a b...

tptacek · on April 30, 2021

Straightforward authority servers like these are very easy to write, in memory-safe languages. It might actually take longer to figure out how to get a BIND configuration doing what Replit does here, but even if it didn't, the resulting server is much safer than BIND, and does exactly and only what you want it to do.

I think: a pretty easy engineering decision.

tialaramex · on April 30, 2021

I don't disagree overall that a competent team really could get this right in the same time it'd take to make it work how you wanted with BIND, but although DNS is simple there are some corner cases where an incorrect solution will appear to work in trivial scenarios - "It worked in my Chrome" but may have either functional problems or security problems.

For example the 0x20 trick. The specification for DNS is clear that you aren't supposed to care about bit 0x20 in labels. ClOWnS and cLowNs and clowns and CLOWNS are all the same label as far as DNS is concerned.

However your answers need to bit-for-bit match the question you were asked. So if you answer "ClOWns A?" with "CLOWNS A 10.20.30.40" that's a mistake, you were asked about "ClOWns" not "CLOWNS". In 1995 if your DNS server got this wrong nothing of consequence breaks. But in 2021 if you get this wrong some important things magically don't work.

The transaction IDs that should make forging DNS answers hard are very short, and so to beef that up slightly some stacks will hide more bits in the 0x20 bit of labels where they will be echo'd back by a compliant implementation. But to reap this reward they must ignore answers that get the 0x20 bits wrong, like yours.

I feel like if "You can't get this wrong" (a stronger claim that you admittedly didn't make) was true, my visits to the Let's Encrypt community site wouldn't all begin by ignoring the people whose problem is obviously just that their DNS server doesn't work properly. Some of them have problems an authority server doesn't care about, but lots of them have dumb problems you'd imagine are impossible and yet apparently people have successfully sold commercial DNS servers with those problems.

tptacek · on April 30, 2021

Your comment suggests that the answer records in a DNS response need to be bit-for-bit identical to the original query. But the Vixie 0x20 draft says only that the question section in the response needs to be identical for this trick to work --- which is the ordinary way you'd implement an authority server (answers might come from a database or whatnot, but in both miekg/dns and the Rust NLNet library, the natural way to formulate a response simply copies the original query record).

At any rate: the possibility of breaking a 0x20-enforcing resolver scares me a lot less than depending on BIND, whose last memory corruption vulnerability was announced (checks notes) yesterday.

saurik · on April 30, 2021

How is BIND still so bad 20 years after everyone already knew it was so bad :(.

lawnchair_larry · on May 1, 2021

You should rewrite it in rust

tptacek · on May 1, 2021

Bit by bit, that's what everyone is doing.

throw0101a · on April 30, 2021

For anyone curious about 0x20:

* https://tools.ietf.org/html/draft-vixie-dnsext-dns0x20

I'm not surprised that Vixie is involved. :)

bluejekyll · on April 30, 2021

The simple solution to this is to return the same binary query as was received from the request.

It’s easy to screw up, definitely agree, but fairly clear how to fix when it’s pointed out. (I made that mistake)

nine_k · on April 30, 2021

If you don't mind, what things would magically break if the case of the answer does not match? Comparisons should be case-insensitive anyway.

Also, since DNS labels are strictly ASCII (this is why punycode exists), why converting all of them to the canonical upper case won't be a good idea?

octomelon · on April 30, 2021

There's a lot of daft DNS code out there that makes strange assumptions. My favorite example isn't quite relevant here but I'll mention it anyway because I think it paints the picture well: PowerDNS purportedly added compression support in its responses because a common stub resolver required answer owner names to be a compression pointer to the question.

EDIT: Found a reference:

"Turns out some customers were using a CPE router that thought ‘C0 0C’ was some kind of ‘answer starts here’ marker. And if it did not find that marker, its DNS component would crash. And at that time, PowerDNS did not compress that first response record, so there was no ‘C0 0C’."

-- https://berthub.eu/articles/posts/history-of-powerdns-2003-2...

0x0 · on April 30, 2021

The random mixed-case patterns are used to reduce the probability that someone evil could send spoofed DNS replies to your queries and have your DNS resolver trust those fake replies.

nine_k · on April 30, 2021

But if an adversary can send a spoofed reply with the desired domain name at all, I expect that the adversary could read the original request packet, too?

tialaramex · on April 30, 2021

They're racing. The adversary needs to have their spoofed reply arrive first so you'll accept it as genuine. They will most often seek to arrange to reply to a query they guess you've asked, such that their answer arrives after it's asked but before you receive the honest answer.

This is why that transaction ID matters, the honest answer will copy the transaction ID verbatim from your question in the answer, so you get to pick it at random (back when I was a child it might just be a sequential counter) and your adversary has to guess it. But, alas the ID isn't very wide, so they really do have a good chance to just guess it. Hence, let's hide more random bits elsewhere in our queries to get a better chance of foiling the adversary.

How does an adversary guess what you're asking? Well, for one thing they might have chosen the question you're about to ask. When a bad guy's web site has an image at the top with <IMG SRC="http://real.website.example/header.jpg"> doesn't your web browser try to look up real.website.example to go get the image? Very predictable.

0x0 · on April 30, 2021

No, you can do it blindly. For example, if you have a web page that has an <img> tag pointing to a specific domain, then you can assume the client will perform a DNS lookup, so you can send fake replies blindly, hoping it will match a lookup request that you don't see.

Adding mixed case matching makes it more difficult to make a lucky guess when sending a fake reply blindly.

nexuist · on April 30, 2021

To add on to this, DNS is meant to be a simple protocol. The problems we encounter with it today are usually due to the thousands of little patches added on through the decades to tack functionality onto what is essentially supposed to be a key-value store.

cbrewster · on April 30, 2021

Hi, author here! tptacek is right on the money. The authority server is really small and simple, and writing it in Go meant we had access to our existing internal packages that had the logic to fetch the data we need for each DNS query. This seemed like the most straight-forward path.

rgacogne · on April 30, 2021

Hi, PowerDNS dev here! Would you mind fixing your handling of NODATA answers, which are currently lacking the AA bit and a SOA record in authority? It does not seem to affect all instances, so perhaps it only happens on the legacy infra. You can get the details here, for example: https://dnsviz.net/d/b.b.b.b.b.b.b.b.a.a.a.a.a.a.nope.repl.c...

cbrewster · on April 30, 2021

Nice catch! I'll get that fixed up. Thanks :)

tptacek · on April 30, 2021

miekg/dns is really excellent. Our custom DNS server is written in Rust (with the NLNet libraries, which are also great), but I used miekg/dns to throw together a DNS telemetry system that we use to keep metrics on our DNS (and UDP) service from around the world using off-net hosts, which sounds cool to type out but was an absurdly simple coding project because of how good the libraries are.

More people should do cool weird stuff with DNS. (And Replit should host stuff on Fly! But also the DNS stuff we're talking about.)

It's a great post, thanks for writing it.

ignoramous · on April 30, 2021

> And Replit should host stuff on Fly! But also the DNS stuff we're talking about.

Having used both, I'd wager replit competes with fly.

tptacek · on April 30, 2021

That wasn't a dig!

ignoramous · on April 30, 2021

We had some success with CoreDNS (over BIND et al): https://coredns.io/manual/toc/

Switched later to OctoDNS, mostly because we didn't want to run DNS infrastructure or deal with racing updates to records: https://github.com/octodns/octodns

gregmac · on April 30, 2021

At work we have a service built on powerdns-backend (http). The biggest pain there was lack of docs. There's circumstances where powerdns would make multiple requests to the backend app server, and figuring out how to answer its first direct queries in a way to satisfy it on the first query took some trial and error (this was a few years ago, so I don't remember specifics). I seem to remember it asking the backend for ANY records and then a bunch of metadata stuff, which meant having to spend time understanding not just how to answer the simple DNS TXT and A queries we were trying to do, but understanding how powerdns would interpret the client query, pass it to the backend, and interpret the result (which was quite different from straight DNS).

That also made it really hard to test, short of setting up a full end-to-end integration test including running powerdns (which I think we have, but isn't fully automated).

If I was building that service again, I'd definitely have a serious look into building it directly as a standalone DNS server rather than a backend for something else.

Habbie · on April 30, 2021

Hello! PowerDNS developer here. Did you spot https://doc.powerdns.com/authoritative/appendices/internals.... and https://doc.powerdns.com/authoritative/appendices/backend-wr... ?

And if not (or also if you did), can you suggest documentation additions that would have helped you here?

gregmac · on April 30, 2021

I think that page would have been incredibly useful, but doesn't look like it existed.

I was just looking through the source history of the project a bit as I wasn't the original developer (I've just done some fixes; most recently fixing some things going from 4.2 to 4.4). The original code was written in 2016, when the documentation was a lot more sparse [1]. Kudos for the improvements!

I just filed a PR [2] fixing the doc issues I ran into (probably should have done that at the time).

To enumerate some things I see right now (using remote http backend):

* pdns always queries /lookup/example.com./SOA, /getAllDomainMetadata/example.com. /lookup/example.com./ANY -- in my case that seems a bit wasteful: there's no metadata, and SOA is returned in the /ANY request anyway. I'm unclear if there's a misconfiguration, something we're doing wrong, or this is just "how it works".

* If I query `some.domain.example.com.` (and we aren't serving requests for any part of that), the backend returns `{result:false}` -- and then pdns proceeds to query `domain.example.com.`, `example.com.`, `com.`, and `.` before finally returning SERVFAIL. It would be nice if (1) it didn't have to run so many queries, but also (2) my understanding is the appropriate response should be REFUSED (as this is not a general DNS resolver) -- but I don't know how to get PowerDNS to do that.

That last bit is what I've found most frustrating about working with pdns: everything feels like "just do what pdns asks and it'll sort out what it wants to do as a result" and I can't just tell it "returned REFUSED for that query". Maybe some examples would help with this?

[1] https://web.archive.org/web/20170615183449/https://doc.power...

[2] https://github.com/PowerDNS/pdns/pull/10345

Habbie · on May 1, 2021

Hey, thanks for the PR, love that!

The wasteful SOA query is a result of our internal code flow. You can reduce wasteful queries a bit with the new `consistent-backends` setting, but the SOA query will remain.

We (no longer) have a knob to disable metadata, but there's a cache that should also remember your empty response (the default value for `domain-metadata-cache-ttl` is 60 seconds).

If you have a SOA for example.com, pdns will never return REFUSED for anything inside example.com. The zone is there, so an answer must be present - either a name with records, a name with no records for that type, or the name does not exist. (The distinction between the latter two is why you mostly see ANY queries instead of the type the client asked for).

As far as I can tell, you should never be returning 'false' to anything, as that indicates failure, which is different than 'I know I do not have what you are asking for'. Think of it like this: your SQL database does not go 'oh no' when it has zero rows for you; it just gives you zero rows. A pdns remote backend should behave the same. If you return 'empty' for those lookups into domains you have nothing for, instead of false, I expect REFUSED will come out instead of SERVFAIL.

(And, if I'm correct in the previous paragraph, your PR is correct too :-) )

tyingq · on April 30, 2021

Unbound is pretty popular with the folks I work with, but in this case, they are only serving up things they are authoritative for, so they don't need much.

wut42 · on April 30, 2021

Isn't unbound a resolver, and not an authoritative server (nsd) ?

tyingq · on April 30, 2021

Ah, yep, right. Argh, too late to edit/delete.