Hacker News new | past | comments | ask | show | jobs | submit | 152334H's comments login

VP of AI at FAIR, who is unrelated to the llama/genai team.


Is it free-priority based?

I was told by an employee that GDM internally has a credits system for TPU allocation, with which researchers have to budget out their compute usage. I may have completely misunderstood what they were describing, though.


You are correct on true H100 ownership costs being far lower. As I mention in the H100 blurb, the H100 numbers are fungible and I don't mind if you halve them.

MFU can certainly be improved beyond 40%, as I mention. But on the point of small models specifically: the paper uses FSDP for all models, and I believe a rigorous experiment should not vary sharding strategy due to numerical differences. FSDP2 on small models will be slow even with compilation.

The paper does not tie embeddings, as stated. The readout layer does lead to 6DV because it is a linear layer of D*V, which takes 2x for a forward and 4x for a backward. I would appreciate it if you could limit your comments to factual errors in the post.


My bad on the 6 D V estimate; you are correct that if they do a dense decoding (rather than a hierarchical one as google used to do in the old days) the cost is exactly 6 D V. I cannot edit the GP comment and I will absorb the shame of my careless words there. I was put off by the subtitle and initial title of this HN post, though the current title is more appropriate and correct.

Even if it's a small model, one could use ddp or FSDP/2 without slowdowns on fast interconnect, which certainly adds to the cost. But if you want to reproduce all the work at the cheapest price point you only need to parallelize to the minimal level for fitting in memory (or rather, the one that maxes the MFU), so everything below 2B parameters runs on a single H100 or single node.


I think the commenter was thinking about the input embedding layer, where to get an input token embedding the model does a lookup of the embedding by index, which is constant time.

And the blog post author is talking about the output layer where the model has to produce an output prediction for every possible token in the vocabulary. Each output token prediction is a dot-product between the transformer hidden state (D) and the token embedding (D) (whether shared with input or not) for all tokens in the vocabulary (V). That's where the VD comes from.

It would be great to clarify this in the blog post to make it more accessible but I understand that there is a tradeoff.


thanks



just set up a desktop service to randomly open a paper once every few hours

if they're not too boring, and you're not doing anything important, you'll read it for fun


I agree, and really hope that Meta is doing something in that vein. Reducing the FLOPs:Memory ratio (as in Soft MoE) could also open the door to CPU (or at least Apple Silicon) inference becoming more relevant.


Thanks. I'm really no expert (:P) on MoE research; I just noticed what was written in the Soft MoE paper and felt a need to check.

The non-deterministic outputs are really similar, yeah, if you check the gist examples I linked https://gist.github.com/152334H/047827ad3740627f4d37826c867a.... This part is at least no surprise, since the randomness should be bounded.

I suspect OpenAI will figure out some way to reduce the randomness at some point, though, given their public commitment to eventually adding logprobs back to ChatCompletions.


I don't think this commitment had any plausibility. Token "probabilities" only have a straightforward probabilistic interpretation for base models. In fine-tuned models, they do no longer represent the probability of the next token given the prompt, but rather how well the next token fulfills the ... tendencies induced by SL and RL tuning. Which is presumably pretty useless information. OpenAI has no intention to provide access to the GPT-4 base model, and they in fact removed API access to the GPT-3.5 base model.


Topic laundering, the probabilities are the probabilities, you don't suddenly get wrong probabilities with more training on more data


You do, because it’s not just more training it’s PPO updates instead of MLE. It’s no longer trying to estimate the token distribution of the training corpus, it’s trying to shift logprobs into tokens that maximize expected reward from the RM. The GPT-4 technical report has a figure showing that logprobs become less well calibrated as confidence scores in the RLHF vs pre-train model.


Fascinating, ty


Strange. There seems to be no mention of vast.ai in the GPU Cloud Providers list, but they're definitely cheaper than all of the options listed.

Perhaps the writer kept them unmentioned in an attempt to reduce demand?


Two users I talked with mentioned bad experiences with them. Not that it's always bad, and they mentioned that it can be good and I know the pricing is often great, but they noted bad experiences with unreliable instances. Therefore I don't want to recommend it to most people.


> I think it's more plausible that teaching focuses on this surface knowledge because it's much easier and more legible, and looks and feels very much like "programming education" to someone who does not have actual domain knowledge (because other subjects are usually done in the same way), or who isn't thinking very much about it, and then similar problems and a notion that testing should be "fair" and "cover what students have learned" lead to insufficiently outcome-oriented exams, which then sets up incentives biasing students in similar directions.

And that's how you end up with CS students aceing theory exams, while completely flunking the coding exam...

> Computers are not at all human, in that they do exactly what someone has set them up to do, which is often not what they thought they were doing, while many beginners expect them to "understand what they meant" and act accordingly. Every simple-looking capability is burdened with detail: the computer "knows what time it is" (thanks to some nontrivial engineering with some possible failure points); the out-of-order CPU "runs just like an abstract in-order machine, but very fast" (until security researchers find a difference); DNS "resolves domain names to IPs" (but is frequently intercepted by networks, and can also serve as a covert backchannel); video codecs "make videos smaller" (but are also complex domain-specific programming languages); text rendering "is just copying bitmaps into the right places" (unless you care about Unicode or antialiasing or kerning).

Although people don't really encounter complex leaky abstractions like that as a beginner in coding. More likely, they'll encounter some simpler poor abstraction like Scratch blocks not being all fully composable in an intuitive fashion, or a sheer wall of complexity (e.g. getting taught in Java/C for an introductory course)


Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: