I talk about this more explicitly in the PyCon talk (https://pythonspeed.com/pycon2025/slides/ - video soon) though that's not specifically about Pydantic, but basically:
1. Inefficient parser implementation.
It's just... very easy to allocate way too much memory if you don't think about large-scale documents, and very difficult to measure. Common problem with many (but not all) JSON parsers.
2. CPython in-memory representation is large compared to compiled languages.
So e.g. 4-digit integer is 5-6 bytes in JSON, 8 in Rust if you do i64, 25ish in CPython. An empty dictionary is 64 bytes.
Often the legacy of an engineer (or team) who "did what they had to do" to meet a deadline, and if they wanted to migrate to something better post-launch, weren't allowed to allocate time to go back and do so.
At least JSON or CSV is better than the ad hoc homegrown formats you found at medium-sized companies that came out of the 90's and 00's.
Once you switch to ijson it will not save any memory, no, because ijson essentially uses zero memory for the parsing. You're just left with the in-memory representation.
I rarely need to dynamically add attributes myself on dataclasses like this but unfortunately this also means things like `@cached_property` won't work because it can't internally cache the method result anywhere.
One key question in these sort of things is how free() works: it is given a pointer, and it has to decide whether this was sampled or not, with _minimum_ effort.
Poireau does this, IIRC, by putting the pointers it sampled in a different memory address.
Sciagraph (https://sciagraph.com), a profiler I created, uses allocation size. If an allocation is chosen for sampling the profiler makes sure its size is at least 16KiB. Then free() will assume that any allocation 16KiB or larger is sampled. This may not be true, it might be false positive, but it means you don't have to do anything beyond malloc_usable_size() if you have free() on lots and lots of small allocations. A previous iteration used alignment as a heuristic, so that's another option.
It's somewhat domain specific. Pure Python libraries have easier time supporting new releases than libraries that rely on C APIs, and even slower are those that deal with less stable implementation details like bytecode (e.g. Numba). But definitely getting better and faster every release.
There's overhead in transferring data from CPU to GPU and back. I'm not sure how this works with internal GPUs, though, insofar as RAM is shared.
In general, though, as I understand it (not a GPU programmer) you want to pass data to the GPU, have it do a lot of operations, and only then pass it back. Doing one tiny operation isn't worth it.