Fundamentally it is more efficient to process a batch of tokens from multiple users/requests than processing them from a single user's request on device.
Maybe on a particular model/dataset but extremely unlikely in general. Again, like another commenter pointed out: if you truly believe it isn't that hard we would love to hire you at Meta ;)
Yes some operations are not supported in MPS/TPU and falls back to slower CPU. But for common architectures like transformers and convnets, it works very well for all the datasets.
I never claimed it was easy. I meant in my opinion it is in the order of 10s of millions dollars of investment, not a trillion dollar CUDA moat that people comment here.
Our group works on some of this stuff at Meta, and we have a pretty good diversity of backgrounds - high performance computing (the bulk), computer systems, compilers, ML engineers, etc. We are hiring.
lmkd (low memory killer daemon) works fairly differently off of a different set of signals and different policy. But yes, conceptually they try to achieve the same goal.
I also do not know if Android combines system libraries into one big file for the savings, something Apple devices do.
Same, doing it once out of my own curiosity to see how the corporate machine works.
Not doing it again - seeing first hand how it is due to managerial incompetence more than anything else. The "reward" ratio is just not worth it: if I pull something off; managers will claim its due to their "processes" and "leadership". If I don't pull it off; managers won't promote me.
No win, so... just don't take fake "deadlines" too seriously.
Umm, from whatever I have seen in big tech "impact" is also fairly arbitrary. It all is based on how cozy one is with one's manager, skip manager, and so on. More accurate is "perception of impact".
Especially as it gets more and more nebulous at higher levels.
> A page will be loaded in if any part of it is useful. Given that functions will be laid out more or less randomly throughout a shared library, and programs use a randomly scattered subset of the functions, I think its safe to say that you'll get a lot of bytes read in to ram that are never used.
It is possible to create an implementation based on the above that decodes 24 dimensional points to the closest Leech lattice vector in < 1 microsecond per point on my AMD Ryzen laptop. Combine with some fast random projections/Johnson Lindenstrauss as described in the article to form the LSH.
This LSH family is unfortunately not present in FALCONN, but the alternatives in FALCONN are pretty good.
Not really true anymore unfortunately, and the system is converging with other large organizations.
For example, to keep compensation in line with market and reduce attrition, there were definitely more people who got promoted to ICT5 last year and the designation has thus been diluted.
Furthermore, candidates coming in from other large companies expect the title/leveling prestige in many cases, and ICT5 is a tough sell while trying to hire a Google L7/8. So Apple does have a fair number of ICT6 from that.
I also think limiting IC layers and keeping the above politics minimal can work - Netflix did that for a long time and was very successful as a company with their approach. Most IC at Netflix were simply titled "senior software engineer", with pay being a wide band dependent on market value. They no longer do that for some reasons, and have adopted the standard large level hierarchy.
Fundamentally it is more efficient to process a batch of tokens from multiple users/requests than processing them from a single user's request on device.
reply