More

donaldstufft · on May 27, 2023

There is zero chance we require biometrics on PyPI.

donaldstufft · on May 27, 2023

You can have multiple accounts associated with a single project, so each person can have their own account and you just add them all as owners to the projects.

That can be annoying to keep in sync if you have a lot of projects, but we're rolling out organization support to make that easier for people.

donaldstufft · on May 27, 2023

That's great!

While we strive to make PyPI useful for everyone we totally understand that sometimes the trade offs we have to make just don't work for everyone so we try really hard to enable folks like yourself to be able to set up their own repositories. I'm glad that it's working out for you and that you've got a setup you like.

I do want to mention two things:

We've got a PEP (PEP 708) going through the works that will tighten the security model around multiple repositories down some more. If I understand your uses well enough you should be able to add a line or two of HTML to your repository and not have any interruptions or warnings. That PEP isn't accepted yet or implemented or anything, but something to keep in the back of your mind at least.

While we don't make any sort of raw download logs available, we do publish what is essentially a query-able database of download events that have been parsed already to make it easy to see those stats. We do have a little bit of redaction on those events, primarily to avoid leaking PII like IP addresses and such, where instead of an IP address we log broad geographical area (country I think?).

If anyone is curious to see that, it's hosted in Google BigQuery (sorry, it does require a Google account) and there's a guide at https://packaging.python.org/en/latest/guides/analyzing-pypi... that tells you more about it.

dalke · on May 27, 2023

Well, speak of the devil and he doth appear! :)

I looked at PEP 708. I was confused by what "repository" means. In PEP 503 "A repository that implements the simple API is defined by its base URL .... Within a repository, the root URL (/ for this PEP which represents the base URL) MUST be a valid HTML5 page with a single anchor element per project in the repository.".

A repository contains projects - "individual project contained within a repository".

PEP 708 seems to use "repository" to mean both that and individual project. Consider "To enable one repository to extend another, this PEP allows the extending repository to declare that it “tracks” another repository by adding the URL of the repository that it is extending."

The examples show project with new entries tracking a project on another repository.

This made it hard for me to understand what something like "repository owner" really means.

> If I understand your uses well enough you should be able to add a line or two of HTML to your repository and not have any interruptions or warnings.

I'm not sure it works, at least, not fully.

I tell people to use:

  python -m pip install chemfp -i https://chemfp.com/packages/

This used to contain only one project, "chemfp".

Now it also contains "click" and "tqdm" entries, copied verbatim from the respective PyPI project entries, because I recently added my first required install dependencies, and -i doesn't automatically fall back to PyPI.

I use '-i' because I don't want installs to start using the old chemfp version on PyPI. (Why? I only distribute pre-compiled wheels for 'manylinux'. I didn't release Python 3.11 wheels until a few weeks ago. I don't want pip for 3.11 users, or on macOS, to find the source version and try to install them. And "We’ve spent 15+ years educating users that the ordering of repositories being specified is not meaningful, and they effectively have an undefined order." ;)

I would prefer to not maintain copies of the click and tqdm project entries, as I need to remember to refresh them.

I think with PEP 708 I can have a single

    <meta name="pypi:alternate-locations" content="https://pypi.org/simple/click/">
    ...
    <body></body>

which will fix that pain point.

However, the main issue I have is namesquatting is still too easy. If I started chemfp now, with no PyPI entry, then from my side nothing changes.

But I've had people do "pip install chemfp" WITHOUT the -i option then ask what why it didn't work.

I assume that's because people aren't used to using a -i (or configuring it in their requirements.txt) so aren't sensitive as to why it's important.

Namesquatting a purely non-PyPI project then comes easy - register it on PyPI. PyPI is active about namesquatting, but you all surely don't track all small non-PyPI projects.

What I would like is, I thought, pretty simple:

    python -m pip install chemfp.com:/package/chemfp/

with the ability to also specify a path like that in the requirements.

That's considered in PEP 708 as possible ("To my knowledge the only systems that have managed to do this end up piggybacking off of the domain system and refer to packages by URLs with domains etc") but rejected ("our ability to retrofit that into our decades old system is practically zero without burning it all to the ground and starting over" ... "This would upend so many core assumptions ...")

This means while some of my current issues will be assuaged with this PEP, my fundamental concern will not.

donaldstufft · on May 27, 2023

TOTP is a good choice, and helps a lot on keeping everyone's account safe! Thanks a lot for taking the time to investigate your options

I do feel want to mention though (largely because I think it's pretty cool), that those security devices are using a private/public key system under the covers, and they're actually designed to be privacy friendly and phishing resistant. One of the problems with TOTP based 2FA is that since it's asking users to type the TOTP code into the website, they can be phished and tricked into typing their password and TOTP code into an attacker's website, who then quickly go and use it to sign onto their account.

Those hardware tokens prevent that phishing from happening. They basically create, on the fly, a public/private key pair that is bound to the domain name of the site in question, and then give the public key of that to the site. When you come back to log in again, the site tells the hardware token what public key it has, the token looks a the site's domain and determines if it has that key for that domain, and if it does it uses a signature to prove ownership of the private key.

It all ends up working really well, since the domain name (actually the protocol, domain name, and port) is part of the identify of the key pair, it is impossible for it to get entered on the wrong site, so it completely eliminates phishing. Then since every single site gets it's own brand new keypair generated for it, there's no way to determine that the hardware token used on Site A is the same as the hardware token used on Site B. So it's entirely privacy preserving as well!

The protocol is obviously a bit more complicated then that, but that's the general idea of it.

donaldstufft · on May 25, 2023

One important thing to remember here is that PyPI was originally started in 2002 as a weekend hack project that grew overtime to become the piece of critical infrastructure it is today. There's a lot of stuff in PyPI that exists as historical baggage and cruft and reviewing them just never bubbled up to be a priority. Likewise a lot of the policies it has have been added and grown overtime as something happened that caused us to need one.

On top of all of that, it's volunteer run and has been understaffed for basically it's entire life, so sitting down and figuring out a proper data retention policy that takes a holistic view of everything we have just never bubbled up.

In general I think we already do a pretty good job of collecting a minimal amount of data, and hopefully with proper policies we can do an even better job.

donaldstufft · on May 25, 2023

> Before you can submit packages to Debian you have to get an existing Debian developer to sign your PGP key. In Debian the trust flows downward from older developers to newer developers.

This is not how signing works in Debian at a technical level. At at technical level uploading to Debian requires them to add your key to a list of keys maintained by the archive administrators. As a matter of policy those administrators ask you to get your key signed by an existing Debian Developer, but at no point does their upload infrastructure check that or use the Web of Trust.

upofadown · on May 25, 2023

That list of keys maintained by the archive administrators are signed by debian developers. That is how the archive admins can be sure that the key is in some sense legit. Otherwise where would be the root of trust?

donaldstufft · on May 25, 2023

The root of trust for uploads is the listed of signatures maintained by the archive administrators, flat out.

The requirement for having individual keys signed by Debian Developers just makes it easier for the archive administrators to decipher which keys they want to add to their root of trust. The upload system does not check those signatures at all, they do not need to exist in the slightest as far as the upload system is concerned.

mistrial9 · on May 25, 2023

this seems motivated ulterior to the topic, or making a mountain out of a small hill for other reasons. The act of approval is done approximately manually at first, with automation supporting that decision over time. Perfect machines are in short-supply, so to this day there is some manual aspect to this, which is faulted with a tone that is dire ... doesn't add up based on my understanding of this

donaldstufft · on May 25, 2023

This post is pretty misleading.

PyPI still fully supports mirrors (though it is becoming increasingly hard to run a full mirror of PyPI, last I looked a full copy of PyPI is about 30TB).

The only thing we ever removed was designating any particular mirror as official and an auto discovery protocol that was quite frankly extremely insecure and slow. That worked by giving every single mirror that wanted to be an "official" mirror for auto discovery a subdomain of `pypi.python.org`, labeled {a-z}.pypi.python.org. A client would determine what mirrors were available by querying last.pypi.python.org, which was a CNAME pointing to the last letter that we had assigned, that would tell it how many mirrors there were, then they could work backwards from that letter. So if the CNAME pointed to c.pypi.python.org, the client would know that a, b, and c existed.

Immediately you should be able to see a few problems with this:

- It is grossly insecure. Subdomains of a domain can set cookies on the parent domain, depending on ~things~ they can also read cookies.

- It does not scale past having 26 mirrors.

- It does not support removing a mirror, there can be no gaps in the letters.

So we needed to remove that auto discovery mechanism, which raised the question of what, if anything, we should replace it with?

Well at the time we had only ever made it up to g.pypi.python.org. So there was only 7 total mirrors that ever asked to become an official mirror. To my knowledge we never reused a letter, if a mirror went away we would just point the mirror back at the main PyPI instance. I don't remember exactly, but my email references there being only 4 mirrors left.

From my memory at the time, most of those 4 mirrors were regularly hours or days behind PyPI, would regularly go offline, etc.

But again, we never stopped anyone from running a mirror, we just removed the auto discovery mechanism and required them get their own domain name. We even linked to a third party site that would index all of the servers and keep track of how "fresh" they were, and other stats (at least until that site went away).

Running a mirror of PyPI is a non trivial undertaking, and most people simply don't want to do that. We never had many mirrors of PyPI running, and as it turns out once we improved PyPI most people decided they simply didn't care to use a mirror and preferred to just use PyPI, but still to this day we support anyone to mirror us.

dpifke · on May 25, 2023

I misrembered the PyPI mirror system (pre-Fastly) being more similiar to Debian[0], I didn't realize it had so many problems.

Debian managed to solve all of the concerns you listed, what makes PyPI unique?

[0]: https://www.debian.org/mirror/list

donaldstufft · on May 25, 2023

So there's a few things here:

Firstly, Debian's mirror network URLs allow a mirror operator to attack the base Debian.org site if they rely on cookies on debian.org (they may not, I'm not sure). Specifically the `ftp.<country>.debian.org` aliases cause this. On PyPI we did use cookies at the base url, so this was a non starter for us to keep.

The second thing here is that Debian and PyPI from a technical level about how mirrors are configured and hosted are generally similar. Meaning other than the above aliases, mirrors are expected to have their own domain and users are expected to configure apt or pip to point to a specific domain. Debian does have a command that will attempt to do that configuration for you to, to make it easier.

The third thing is that Debian's mirrors are as secure as the main repository is against attacks from a compromised mirror operator. This isn't the case in PyPI where you're forced to trust the mirror operator to serve you the correct packages. There is vestigal support for a scheme to support this in the mirroring PEP, but nothing ever really implemented it except the very old version of PyPI (none of the clients, etc). That scheme is also very insecure, so it doesn't really provide the security levels it was intended to.

The fourth thing is that a Debian mirror is easier to operate.

Packages on Debian don't live forever, as new versions are released old versions get removed, and as OS releases move into end of life, entire chunks of packages get rotated out. However on PyPI we don't have the concept of an OS release, or any sort of phasing out of old packages. All packages are valid for as long as the author makes them available. This means that the storage space to run a PyPI mirror (currently ~30TB) is a lot more than the storage space for a Debian mirror (~4TB).

On top of that the way apt and pip function are inherently different. Apt has users occasionally download the entire package set so that apt has a local copy of the metadata while pip asks the server for each package for the metadata (it does some light caching, but not a lot). This means that to discover what packages are available, apt might make one request a day while pip might make 100 requests for every invocation of pip. Packages on apt release a lot slower and less often than on pip. so many times people may not be needing to download more than a handful of packages, but people generally need to download a lot of packages from PyPI at a time.

I believe? the Debian mirroring protocol is rsync based, which is generally pretty reliable, while the PyPI mirroring protocol is a custom one which works, but it sometimes has a tendency to get "stuck" every few months and require operators to notice and fix themselves.

I suspect the differences between the strength of the mirror network is some combination of the two, but I suspect the the third and fourth things are the biggest differences, particularly when PyPI's CDN solved the problem in most users minds that would cause them to want to host or use a mirror.

donaldstufft · on May 25, 2023

I would resign from PyPI before I ever allowed a backdoor to be installed.

I haven't explicitly asked, but I would be very surprised if any of the other PyPI admins felt differently.

wraptile · on May 25, 2023

PyPI is clearly a passion project for the team and Python community in general so I can't imagine that anyone would allow this or die on this hill to save their salary.

I've tried to dig around whether there's any history or potential of government stopping company from ceasing operation/resigning and honestly nothing came up that wasn't ww2 related. So, I think it's pretty safe to rule out PyPI from doing anything like this.

dpifke · on May 25, 2023

My comment was not meant to imply that PyPI admins would be OK with this, but the sad situation in the U.S. (and Australia, and other places) is that they'd probably face jail time if they refused to comply. You can't avoid complying with a court order by saying, "sorry, I quit." (And even if "sorry, I quit" was a valid response, you'd be facing tens of thousands of dollars in legal fees to justify it, with a gag order in place that meant you couldn't raise a legal defense fund.)

If you're looking for examples of what the NSL process is like, Nicholas Merrill's story[0] comes to mind.

Further, the fact that admins have this power—even if they'd never use it—makes them an attractive target for black hats. If backdooring packages was easier to detect, it'd be a less attractive option for those that might want to do so.

I'm still hopeful that they'll re-implement some sort of end-to-end signing mechanism, sooner rather than later. I trust PyPI and the people behind it, but I'd like to be able to verify.

[0]: https://en.wikipedia.org/wiki/Nicholas_Merrill

donaldstufft · on May 25, 2023

Well, AFAIK it's not clear that in the US the courts have the right to compel someone to modify their software in that way. The FBI holds that it does, but so far it's been fought and they've given up when they've tried it. I think if such a thing were to happen, the fundamental ability to secure any software goes out the window. Even package signing, etc go out the window because they can just compel you to produce new software, signed with your existing key.

But let's step back a moment and presume that they do have that ability to compel. The first step here is that none of the PyPI Administrators are the legal owners of PyPI, so such an order would not be sent to any of us, but rather to the PSF itself. The PSF would then be on the hook to either comply or fight said hypothetical order, but individual members of the administration team would not be, and would be free to quit. They may not be able to say why they've quit, but quitting AFAIK would be entirely possible.

The PSF, while not having Apple's war chest, does retain counsel for dealing with things like this, and I can say personally I'd spend myself broke before I'd be willing to do so.

We are going to be implementing signing, and I'm hoping we'll be able to make strong progress on that soon.

donaldstufft · on May 23, 2023

Sometimes? There's no global policy of doing it in Debian, it's up to individual package maintainers inside of Debian to enable it (it defaults to off AFAIK) and to hardcode the key that they expect the package to be signed by.

In the cases that it is used, AFAIK it is only used by Debian's uscan program, which is sort of like the Debian version of Dependabot, it tells them when there is a new version of something to package. As far as I know, the process of packaging that new version is still manual, and relies on the maintainer downloading the package and packaging it, so they may or may not use the signature in that case.

How useful this is, is up for debate. Many years ago when I first started taking over releasing pip, that caused the pip GPG key to change, and the reaction of the Debian maintainer at the time was to just comment out the signature bit and fall back to no signature.

donaldstufft · on May 23, 2023

I don't believe that Maven Central's use of GPG is providing a meaningful security control here, so I would dispute the idea that they're doing it "right".

jpgvm · on May 23, 2023

At the very least there are a) more active keys b) those keys are available on keyservers and c) it's being used by the major packages in the ecosystem correctly. i.e Spring, Jackson, Quarkus, Logback, Apache-sphere, Google-sphere, etc.

So while it might not be providing meaningful security for lower-tier packages it's definitely doing it's job for top tier packages like these that are relied on by hundreds of thousands of projects.