Hacker Newsnew | past | comments | ask | show | jobs | submit | crashocaster's commentslogin

I always find evals of this flavor offputting given that 3.5 and 4 likely share preference models (or at least feedback data)


Actually, the only numbers every LLM developer should know are their accelerator specs. For example:

A100 specs:

- 312e12 BF16 FLOPS

- 1555e9 GB/s HBM bandwidth

H100:

- 1000e12/2000e12 BF16/INT8 FLOPS

(apply ~0.7 flops efficiency multiplier because h100s power throttle extremely quickly)

- 3000 GB/s HBM bandwidth

---

For a 13B model on an A100, this nets:

13e9 * 2 bytes per param = 26 GB HBM required (at bf16)

26e9/1555e9 = 17ms / token small-batch latency (~60 tokens / second)

What about large batches?

latency for some batch size B is 13e9 * 2 FLOP per param * B / 312e12

We want B such that we're just about no longer HBM bound: 26e9/312e12 * B = 17ms

<=> 17e-3/(26e9/312e12)

giving a batch size of 204.

At that batch size (and all larger batch sizes), the a100 delivers a throughput of B * 1/17ms = 12000 tokens / second

---

KV caching, multi-gpu and -node comms and matmul efficiencies left as an exercise to the reader :)


I’d suggest to revisit voice recognition — it works quite well for me in the same usecase.

I also like to take walks, sometimes listening to podcasts. The stock iOS voice recognition (the microphone button on the keyboard, not Siri) is quite good, I usually just talk into the phone without looking at the output. After the walk, I format and clean up the notes to fix any errors.


My experience of voice to text is that it's painfully nondeterministic; amazing when it works, which is... just short of enough for me to trust it.


That's fair - I live in a bustling city and get self conscious with people being able to hear me text and such. I know it's really effective these days though.


I would have been interested to hear more about the verification techniques and tools they used for this project.


Check out https://cacm.acm.org/magazines/2015/4/184701-how-amazon-web-... ("How Amazon Web Services Uses Formal Methods") and https://d1.awsstatic.com/Security/pdfs/One_Click_Formal_Meth... ("One-Click Formal Methods") for more info.


The folks involved gave a neat talk about the verification techniques and tools they used as part of AWS Pi Week recently: https://www.twitch.tv/videos/951537246?t=1h10m10s


For some large n, integers in the algorithm may be so large that operations on them cease to be constant time.


that is obvious. But how does it change big O? (why any manual implementation would have a better big O compared to existing arithmetics implementation in CPython?)


I think he meant that naively implementing an algorithm may not be bounded by the O notation he/she originally wanted due to code calling other functions “hidden” from the programmer.


This can be done by 'unparsing' the python AST. We do this in the DaCe project: https://github.com/spcl/dace/blob/900cac070c4b083653c672e225...


Loading books using programs like calibre [1] allows you to covert EPUB to MOBI (the kindle format) seamlessly before transferring. In my experience this works perfectly.

[1] https://calibre-ebook.com/


It should be noted that the described graph embedding related tasks are only a small subset of the tasks that GNNs solve. Many (if not most) graph learning techniques focus on more "local" tasks like node classification or edge prediction.


His prerequisite first year course [0] actually matches your description more, and even has students designing their own CPUs on FPGAs. I feel like this course goes more in the direction of his own research interest, and already assumes the basic "breath" you speak of.

[0] https://safari.ethz.ch/digitaltechnik/spring2019/doku.php?id...


Much of the CompArch topics are covered in his first year course: https://safari.ethz.ch/digitaltechnik/spring2019/doku.php?id...


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: