More

gajjanag · 2025-01-29T20:12:05 1738181525

This is much more nuanced now. See Apple "Private Cloud Compute": https://security.apple.com/blog/private-cloud-compute/ ; they run a lot of the larger models on their own servers.

Fundamentally it is more efficient to process a batch of tokens from multiple users/requests than processing them from a single user's request on device.

gajjanag · 2024-12-17T15:58:45 1734451125

Maybe on a particular model/dataset but extremely unlikely in general. Again, like another commenter pointed out: if you truly believe it isn't that hard we would love to hire you at Meta ;)

YetAnotherNick · 2024-12-18T08:18:24 1734509904

Yes some operations are not supported in MPS/TPU and falls back to slower CPU. But for common architectures like transformers and convnets, it works very well for all the datasets.

I never claimed it was easy. I meant in my opinion it is in the order of 10s of millions dollars of investment, not a trillion dollar CUDA moat that people comment here.

gajjanag · 2024-03-13T00:06:46 1710288406

Our group works on some of this stuff at Meta, and we have a pretty good diversity of backgrounds - high performance computing (the bulk), computer systems, compilers, ML engineers, etc. We are hiring.

Feel free to DM me to learn more.

jvanderbot · 2024-03-13T00:54:24 1710291264

I will, thank you. Any info is very helpful.

gajjanag · on Dec 20, 2023

A couple of additional points on how the "low-RAM" works:

1 - https://www.lifewire.com/understanding-compressed-memory-os-... : Apple devices have support for memory compression, see https://opensource.apple.com/source/xnu/xnu-2050.18.24/libke...

2 - Apple devices support something called "jetsam", which basically frees up memory from unused/background apps by killing them in order to keep high priority apps running smoothly: https://developer.apple.com/documentation/xcode/identifying-...

londons_explore · on Dec 20, 2023

I didn't mention either because Android does both of those too. (via zRAM and the Low Memory Killer Daemon)

gajjanag · on Dec 21, 2023

lmkd (low memory killer daemon) works fairly differently off of a different set of signals and different policy. But yes, conceptually they try to achieve the same goal.

I also do not know if Android combines system libraries into one big file for the savings, something Apple devices do.

gajjanag · on April 14, 2023

Umm, Apple still sells devices with just 1 GB of RAM on them ;)

gajjanag · on Feb 19, 2023

Same, doing it once out of my own curiosity to see how the corporate machine works.

Not doing it again - seeing first hand how it is due to managerial incompetence more than anything else. The "reward" ratio is just not worth it: if I pull something off; managers will claim its due to their "processes" and "leadership". If I don't pull it off; managers won't promote me.

No win, so... just don't take fake "deadlines" too seriously.

gajjanag · on Feb 13, 2023

> based on "impact" rather than arbitrary metrics

Umm, from whatever I have seen in big tech "impact" is also fairly arbitrary. It all is based on how cozy one is with one's manager, skip manager, and so on. More accurate is "perception of impact".

Especially as it gets more and more nebulous at higher levels.

gajjanag · on Jan 30, 2023

> A page will be loaded in if any part of it is useful. Given that functions will be laid out more or less randomly throughout a shared library, and programs use a randomly scattered subset of the functions, I think its safe to say that you'll get a lot of bytes read in to ram that are never used.

We have order files for this purpose so that functions are not randomly scattered: https://www.emergetools.com/blog/posts/FasterAppStartupOrder... . This technique is widely used by well known apps.

gajjanag · on Dec 23, 2022

By the way, https://github.com/FALCONN-LIB/FALCONN contains a really good LSH implementation. Also see https://www.mit.edu/~andoni/LSH/ if you want to know more about the research literature.

The fastest way for Euclidean space that I know that works well in practice is via Leech lattice decoding: https://www.semanticscholar.org/paper/Maximum-likelihood-dec... , or https://www.semanticscholar.org/paper/Efficient-bounded-dist... .

It is possible to create an implementation based on the above that decodes 24 dimensional points to the closest Leech lattice vector in < 1 microsecond per point on my AMD Ryzen laptop. Combine with some fast random projections/Johnson Lindenstrauss as described in the article to form the LSH.

This LSH family is unfortunately not present in FALCONN, but the alternatives in FALCONN are pretty good.

Source: thought extensively about LSH for my PhD.

sendfoods · on Dec 24, 2022

FALCONN doesn't look maintained, unfortunately

gajjanag · on Oct 22, 2022

Not really true anymore unfortunately, and the system is converging with other large organizations.

For example, to keep compensation in line with market and reduce attrition, there were definitely more people who got promoted to ICT5 last year and the designation has thus been diluted.

Furthermore, candidates coming in from other large companies expect the title/leveling prestige in many cases, and ICT5 is a tough sell while trying to hire a Google L7/8. So Apple does have a fair number of ICT6 from that.

I also think limiting IC layers and keeping the above politics minimal can work - Netflix did that for a long time and was very successful as a company with their approach. Most IC at Netflix were simply titled "senior software engineer", with pay being a wide band dependent on market value. They no longer do that for some reasons, and have adopted the standard large level hierarchy.

smugma · on Oct 22, 2022

Fair number of ICT6… would you say more than 250 out of what, 25K engineers?