Hacker News new | past | comments | ask | show | jobs | submit login

This post is pretty misleading.

PyPI still fully supports mirrors (though it is becoming increasingly hard to run a full mirror of PyPI, last I looked a full copy of PyPI is about 30TB).

The only thing we ever removed was designating any particular mirror as official and an auto discovery protocol that was quite frankly extremely insecure and slow. That worked by giving every single mirror that wanted to be an "official" mirror for auto discovery a subdomain of `pypi.python.org`, labeled {a-z}.pypi.python.org. A client would determine what mirrors were available by querying last.pypi.python.org, which was a CNAME pointing to the last letter that we had assigned, that would tell it how many mirrors there were, then they could work backwards from that letter. So if the CNAME pointed to c.pypi.python.org, the client would know that a, b, and c existed.

Immediately you should be able to see a few problems with this:

- It is grossly insecure. Subdomains of a domain can set cookies on the parent domain, depending on ~things~ they can also read cookies.

- It does not scale past having 26 mirrors.

- It does not support removing a mirror, there can be no gaps in the letters.

So we needed to remove that auto discovery mechanism, which raised the question of what, if anything, we should replace it with?

Well at the time we had only ever made it up to g.pypi.python.org. So there was only 7 total mirrors that ever asked to become an official mirror. To my knowledge we never reused a letter, if a mirror went away we would just point the mirror back at the main PyPI instance. I don't remember exactly, but my email references there being only 4 mirrors left.

From my memory at the time, most of those 4 mirrors were regularly hours or days behind PyPI, would regularly go offline, etc.

But again, we never stopped anyone from running a mirror, we just removed the auto discovery mechanism and required them get their own domain name. We even linked to a third party site that would index all of the servers and keep track of how "fresh" they were, and other stats (at least until that site went away).

Running a mirror of PyPI is a non trivial undertaking, and most people simply don't want to do that. We never had many mirrors of PyPI running, and as it turns out once we improved PyPI most people decided they simply didn't care to use a mirror and preferred to just use PyPI, but still to this day we support anyone to mirror us.




I misrembered the PyPI mirror system (pre-Fastly) being more similiar to Debian[0], I didn't realize it had so many problems.

Debian managed to solve all of the concerns you listed, what makes PyPI unique?

[0]: https://www.debian.org/mirror/list


So there's a few things here:

Firstly, Debian's mirror network URLs allow a mirror operator to attack the base Debian.org site if they rely on cookies on debian.org (they may not, I'm not sure). Specifically the `ftp.<country>.debian.org` aliases cause this. On PyPI we did use cookies at the base url, so this was a non starter for us to keep.

The second thing here is that Debian and PyPI from a technical level about how mirrors are configured and hosted are generally similar. Meaning other than the above aliases, mirrors are expected to have their own domain and users are expected to configure apt or pip to point to a specific domain. Debian does have a command that will attempt to do that configuration for you to, to make it easier.

The third thing is that Debian's mirrors are as secure as the main repository is against attacks from a compromised mirror operator. This isn't the case in PyPI where you're forced to trust the mirror operator to serve you the correct packages. There is vestigal support for a scheme to support this in the mirroring PEP, but nothing ever really implemented it except the very old version of PyPI (none of the clients, etc). That scheme is also very insecure, so it doesn't really provide the security levels it was intended to.

The fourth thing is that a Debian mirror is easier to operate.

Packages on Debian don't live forever, as new versions are released old versions get removed, and as OS releases move into end of life, entire chunks of packages get rotated out. However on PyPI we don't have the concept of an OS release, or any sort of phasing out of old packages. All packages are valid for as long as the author makes them available. This means that the storage space to run a PyPI mirror (currently ~30TB) is a lot more than the storage space for a Debian mirror (~4TB).

On top of that the way apt and pip function are inherently different. Apt has users occasionally download the entire package set so that apt has a local copy of the metadata while pip asks the server for each package for the metadata (it does some light caching, but not a lot). This means that to discover what packages are available, apt might make one request a day while pip might make 100 requests for every invocation of pip. Packages on apt release a lot slower and less often than on pip. so many times people may not be needing to download more than a handful of packages, but people generally need to download a lot of packages from PyPI at a time.

I believe? the Debian mirroring protocol is rsync based, which is generally pretty reliable, while the PyPI mirroring protocol is a custom one which works, but it sometimes has a tendency to get "stuck" every few months and require operators to notice and fix themselves.

I suspect the differences between the strength of the mirror network is some combination of the two, but I suspect the the third and fourth things are the biggest differences, particularly when PyPI's CDN solved the problem in most users minds that would cause them to want to host or use a mirror.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: