More

reroute22 · on Nov 23, 2023

Such a fantastic paper overall, it was a pleasure to read, it's very accessible, and greatly informative. If anyone is new to the idea and is seeking a definition of AGI, reading this paper is easy and is immeasurably superior to merely googling or reading the wikipedia article.

My only criticism for the article within the particular set of goals outlined above is this:

The paper seems to be under-exploring two aspects that appear to be worth exploring explicitly and in detail:

1. Ability to rapidly learn from a very limited amount of instructory data post-deployment and substantially advance in its abilities in the domain of the learning post-deployment, as opposed to possessing certain level of professional skills immediately on-deployment.

2. Ability to invent entirely new ideas, like for instance inventing an entirely new system of numbers or another symbolic or other system on its own, all to advance its current goals.

Both in part to distinguish an AGI from a large collection of glued together Narrow AIs, each purpose-built for a specific, but entire domain of fairly loosely related tasks, and in part to ensure that a high level AGI system always appears at least as intelligent as an average human teenager across the full spectrum of all possible cognitive and metacognitive interactions with the said teenager (be those interactions initiated by another human or by the cognitive projection of the environment).

Without these abilities, there could be a system - it could be argued, I believe - that would technically (or at least arguably) satisfy the definition of an ASI level of AGI as per the paper that an average human child / teenager may appear more intelligent in comparison to, exceeding the said system in plasticity and real-time limited-input adaptability of the intellect rather than off-the-shelf proficiency in trained adult human tasks: a high level AGI system might be initially trained on trillions of tokens of input data, but once deployed, it needs to be able to acquire new skills and proficiencies from mere tens-to-thousands of input examples - such that humans do.

Perhaps the framework presented by the paper intended to silently encompass these abilities and the remarks above, but surely they deserved a separate discussion, such as other aspects of the definitions and the framework proposed by the paper are indeed explicitly discussed.

Similarly, not including "autonomy" into the "six principles" (making them seven) for composing a definition of an AGI and only discussing it briefly and on a side also appears to be a questionable choice for the same reasons.

reroute22 · on Nov 23, 2023

https://arxiv.org/abs/2310.02989 ("xVal: A Continuous Number Encoding for Large Language Models", Submitted on 4 Oct 2023)

reroute22 · on Nov 23, 2023

It conceptually never admits ignorance and never asks for clarifications. It always produces something, to the best of its ability. It _seems_ to be a minor technical limitation (there is plenty of traditional ML systems producing confidence %% alongside the answer from years if not decades ago, in image recognition in particular), but most likely it's actually a very hard problem, as otherwise it would be mitigated somehow by now by OpenAI, given that they clearly agree that this is a serious problem [2] (more generally formulated as reliability [1])

[1] https://www.youtube.com/watch?v=GI4Tpi48DlA&t=1342s (22:22, "Highlights of the Fireside Chat with Ilya Sutskever & Jensen Huang: AI Today & Vision of the Future", recorded March 2023, published May 16, 2023)

[2] https://www.youtube.com/watch?v=GI4Tpi48DlA&t=1400s (23:20, ditto)

reroute22 · on Nov 23, 2023

Just tried and it completely fell over: https://chat.openai.com/share/ea9cf93f-0632-48b9-aa47-1bbf31...

Tried on Bing in "Precision" mode as well, and it fell over just the same, but starting with C instead of A.

Davidzheng · on Nov 23, 2023

Gpt4: To solve this problem, we need to ensure that neither the ambassador of Costaguana is left alone with the ambassador of Buranda, nor is the ambassador of Buranda left alone with the ambassador of Atlantis. Here's a step-by-step strategy:

1. *First Trip:* The general takes the ambassador of Buranda across first. This prevents any initial conflict.

2. *Return Trip:* The general returns alone to the bunker, leaving the ambassador of Buranda on the other side.

3. *Second Trip:* The general then takes the ambassador of Atlantis.

4. *Return Trip with Buranda:* The general brings the ambassador of Buranda back to the bunker. This is crucial because leaving the ambassador of Atlantis and the ambassador of Costaguana alone would not cause any conflict.

5. *Third Trip with Costaguana:* The general then takes the ambassador of Costaguana across the tunnel.

6. *Final Return Trip:* The general returns alone to the bunker for the last time.

7. *Final Trip with Buranda:* Finally, the general takes the ambassador of Buranda across.

This sequence ensures that at no point are the ambassador of Costaguana and the ambassador of Buranda left alone together, nor are the ambassador of Buranda and the ambassador of Atlantis. Thus, the relationships between the nations remain unescalated.

reroute22 · on Nov 23, 2023

Thank you for providing this.

Bing Chat runs on GPT-4, however [1]. And Bing gets this wrong in all 3 of its modes (Creative, Balanced, and Precise) as of time of writing.

Given this experiment and similar others presented around here, it stands to reason that GPTs(**1) often identify(**2) the problem as a "wolf, goat, and cabbage" problem and then merely guess which node of the problem is the middle node (inner node of the "danger to" graph), yielding a 1/3 chance of getting it right by pure luck, resulting in diverse reports here.

(**2) That does not always yield an adequate response beyond the mere permutation of nodes, however. I've been getting the following variants for step 1. from Bing in Precise in response to marginally slightly different rewordings of the same:

- The general escorts the ambassador of Costaguana through the tunnel first. This leaves the ambassador of Atlantis and the ambassador of Buranda in the bunker, but they are not alone because the general is still there.

- The general escorts the ambassador of Costaguana through the tunnel first. This leaves the ambassador of Atlantis and the ambassador of Buranda in the bunker, but they are not alone because they have each other.

and so on.

(**1) I also tried Bard and Llama 2 with even more disastrous results full of nonsense of (**2) kind. The earlier posted response of ChatGPT-3.5 is also prime with these as well.

Re

> By the way, as soon as these systems are able to check their reasoning (i don't think it'll be a huge leap) it's enough to solve reasoning problems with probability >0.1% for example. Because you can just have it do rollouts in its head until it's correct [2]

Mistakes of type (**2) don't seem to be fitting the target of the cyclic refinement you are proposing, as far as I can understand it. These errors aren't getting the logic wrong, but completely butcher the basic relationships of actors, like what it means to be alone, or spatial relationships between the actors and their environment.

[1] https://blogs.bing.com/search/march_2023/Confirmed-the-new-B...

[2] https://news.ycombinator.com/item?id=38389222

Davidzheng · on Nov 23, 2023

By the way, as soon as these systems are able to check their reasoning (i don't think it'll be a huge leap) it's enough to solve reasoning problems with probability >0.1% for example. Because you can just have it do rollouts in its head until it's correct

reroute22 · on Nov 16, 2023

I'm afraid I'm with your colleagues.

I value being able to grasp what the code does at a glance while scrolling through. Your first snippet is entering my brain like as if without even reading at all, all at once and instantly.

Your second example absolutely does not and I'm sure you know why.

reroute22 · on Nov 10, 2023

I doubt anyone will be able to answer questions this fine grained, not now (if the implementation is architecturally exposed - leaks into the ISA - and Asahi Linux group figures some of it out), or possibly not ever (if it's architecturally transparent and thus entirely micro-architectural).

> Does this mean there, effectively, are no registers?

I can only point out just for context that if by any chance you're asking whether the registers are implemented as actual hardware design "registers" - individually routed and and individually accessible small strings of flip-flops or D-latches - then the history of the question is actually "it never was registers in the first place" - architectural (ISA) registers in GPUs are implemented by a chunk of addressable ported SRAM, with an address bus, data bus, and limited number of accesses at the same time and limited b/w [1].

[1] see the diagram at https://www.renesas.com/us/en/products/memory-logic/multi-po...

RantyDave · on Nov 13, 2023

Oh! Well, that explains that then. Wild!

reroute22 · on Nov 13, 2023

There is a fairly informative survey on the subject: https://www.osti.gov/servlets/purl/1332070 (A Survey of Techniques for Architecting and Managing GPU Register File)

reroute22 · on Nov 13, 2023

An easier to read research article that's narrower in subject and seemingly more relevant to the OP: https://research.nvidia.com/sites/default/files/pubs/2012-12... ("Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor", 2012)

reroute22 · on Nov 10, 2023

"128 * the-number-of-cores" of threads can make progress truly in parallel (at the same time).

24,576 threads (or however many, I didn't validate the number and it depends on the occupancy, which depends on thread resource usage, like registers => depends on the shader program code) is how many threads can be executed concurrently (as opposed to in parallel), as in, how many of them can simultaneously reside on the GPU. A subset of those at any time are actually executed in parallel, the rest are idle.

You can think of this situation as follows using an analogy with a CPU and an OS:

1. 128 * the-number-of-cores is the number of CPU cores(*1)

2. 24,576 threads is the number of threads in the system that the OS is switching between

Major differences with the GPU:

3. On a CPU context switch (getting a thread off the core, waking up a different thread, restoring the context, and proceeding) takes about 2,000 cycles. On a GPU _from the analogy_ that kind of thread switching takes ~1-10 cycles depending on the exact GPU design and various other details.

4. In CPU/OS world the context switching and scheduling on the OS side is done mostly in software, as the OS is indeed software. In GPU's case the scheduler and all the switching is implemented as fixed function hardware finely permeating the GPU design.

5. In CPU/OS world those 2,000 cycles per context switch is so much larger than a roundtrip to DRAM while executing a load instruction that happened to miss in all caches - which is about 400-800 cycles or so depending on the design - that OS never switches threads to hide latencies of loads, it's pointless. As far as performance is concerned (as opposed to maintaining the illusion of parallel execution of all programs on the computer), the thread switching is used to hide the latency of IO - non-volatile storage access, network access, user input, etc. (which takes millions of cycles or more - so it makes sense).

In the GPU world the switching is so fast, that the hardware scheduler absolutely does switch from thread to thread to hide latencies of loads (even the ones hitting in half of the caches, if that happens), in fact, hiding these latencies and thus keeping ALUs fed is the whole point of this basic design of pretty much all programmable GPUs that there ever were.

6. In real world CPU/OS, the threads that aren't running at the time reside (their local variables, etc) in the memory hierarchy, technically, some of it ends up in caches, but ultimately, the bulk of it on a loaded system is in system DRAM. On a GPU, or I suppose by now we have to say, on a traditional GPU, these resident threads (their local variables, etc) reside in on-chip SRAM that is a part of the GPU cores (not even in a chunk on a side, but close to execution units in many small chunks, one per core). While the amount of DRAM (CPU/OS) is a) huge, gigabytes, and b) easily configurable, the amount of thread state the GPU scheduler is shuffling around is measured typically in hundreds of KBs per GPU Core (so on the order of about "a few MBs" per GPU), and the equally sized SRAM storing this state is completely hardwired in the silicon design of the GPU and not configurable at all.

Hope that helps!

footnotes (*1) a better analogy would be not "number of CPU cores", but "number-of-CPU-cores * SMT(HT) * number-of-lanes-in-AVX-registers", where number-of-lanes-in-AVX-registers is basically "AVX-register-width / 32" for FP32 processing which (the latter) yields about ~8 give or take 2x depending on the processor model. Whether to include SMT(HT) multiplier (2) in this analogy is also murky, there is an argument to be made for yes, and an argument to be made for no, and depends on the exact GPU design in question.

xoranth · on Nov 10, 2023

So, in NVIDIA parlance, my Skylake laptop would have 128 "cuda cores"?

128 = 4 (physical cores) * 2 (hyperthreading) * 8 (AVX2 f32 lanes) * 2 (floating point ports per core)

reroute22 · on Nov 10, 2023

Sorta, yeah!

Also, your "128 cuda cores" of Skylake variety run at higher frequencies and work off of much bigger caches, so they are faster (in serial manner)...

...until they are slower, because GPU's latency hiding mechanism (with occupancy) hides load latencies very well, while CPU just stalls the pipeline on every cache miss for ungodly amounts of time...

...until they are faster again when the shader program uses a lot of registers and GPU occupancy drops to the floor and latency hiding stops hiding that well.

But core counts - yes, more or less.

xoranth · on Nov 10, 2023

Thank you!

> ...until they are slower, because GPU's latency hiding mechanism (with occupancy) hides load latencies very well, while CPU just stalls the pipeline on every cache miss for ungodly amounts of time...

Is the GPU latency hiding mechanism equivalent to SMT/Hyperthreading, but with more threads per physical core? Or is there more machinery?

Also, how akin GPUs "stream multiprocessors"/cores are to CPUs ones at the microarchitectural level? Are they out-of-order? Do they do register renaming?

frogblast · on Nov 10, 2023

As you state, GPU latency hiding is basically equivalent to hyper threading, just with more threads per core. For example, for a 'generic' modern GPU, you might have:

A "Core" (Apple's term) / "Compute Unit" (AMD) / "Streaming Multiprocessor" (Nvidia) / "Core" (CPU world). This is the basic unit that gets replicated to build smaller/larger GPUs/CPUs

* Each "Core/CU/SM" supports 32-64 waves/simdgroups/warps (amd/apple/nvidia termology), or typically 2 threads (cpu terminology for hyperthreading). ie, this is the unit that has a program counter, and is used to find other work to do when one thread is unavailable. (this blurred on later Nvidia parts with Independent Thread Scheduling.)

* The instruction set typically has a 'vector width'. 4 for SSE/NEON, 8 for AVX, or typically 32 or 64 for GPUs (but can range from 4 to 128)

* Each Core/CU/SM can execute N vector instructions per cycle (2-4 is common in both CPUs and GPUs). For example, both Apple and Nvidia GPUs have 32-wide vectors and can execute 4 vectors of FP32 FMA/cycle. So 128 FPUs total, or 256 FMAs/cycle Each of these FPUs what Nvidia calls a "Core", which is why their core counts are so high.

In short, the terminology exchange rate is 1 "Apple GPU Core" == 128 "Nvidia GPU Cores", on equivalent GPUs.

reroute22 · on Nov 11, 2023

I'll leave your first question to the other comment here from frogblast, as I really battled with how to answer it well, given my limited knowledge and being an elbow deep into an analogy, after all. I got a writer's block, and frogblast actually answered something :D

> how akin GPUs "stream multiprocessors"/cores are to CPUs ones at the microarchitectural level?

I'd say, if you want to get a feel for it in a manner directly relevant to recent designs, then reading through [1], [2], subsequent conversation between the two, and documents they reference should scratch that curiosity itch well enough, from the looks of it.

If you want a much more rigorous conversation, I could recommend the GPU portion of one of the lectures from CMU: [3], it's quite great IMO. It may lack a little bit in focus on contemporary design decisions that get actually shipped by tens of millions+ in products today and stray to alternatives a bit. It's the trade-off.

> Are they out-of-order?

Short answer: no.

GPUs may strive to achieve "out of order" by picking out a different warp entirely and making progress there, completely circumventing any register data dependencies and thus any need to track them, achieving a similar end objective in a drastically more area and power efficient manner than a Tomasulo's algorithm would.

> Do they do register renaming?

Short answer: no.

[1] https://forums.macrumors.com/threads/3d-rendering-on-apple-s...

[2] https://forums.macrumors.com/threads/3d-rendering-on-apple-s...

[3] https://www.youtube.com/watch?v=U8K13P6loyk ("Lecture 15. GPUs, VLIW, Execution Models - Carnegie Mellon - Computer Architecture 2015 - Onur Mutlu")

Const-me · on Nov 10, 2023

Not quite. These floating-point EUs are shared between both threads of the physical core. I would rather say 64 CUDA cores.

reroute22 · on Nov 10, 2023

Okay, I went through the other video they reference ("Discover new Metal profiling tools for M3 and A17 Pro" [1]), and there is actually a whole bunch of extra very relevant (IMO) information on the subject, starting about 13:30 or so.

[1] https://developer.apple.com/videos/play/tech-talks/111374?ti...

reroute22 · on Nov 10, 2023

Judging by the output/GUI of their GPU profiler, "complex" there is more like "complex instructions", think f32 (floating point) ops that aren't additions and multiplications (and FMAs), but trigonometry, square roots, that sort of thing.

reroute22 · on Nov 10, 2023

To be completely fair, base 14" MBP is only Pro in the name. That laptop has only 2 ports, supports only 1 external monitor, has a low core count CPU, and a relatively weak GPU, disqualifying it for any "Pro" work well before RAM (or storage) even enter the discussion.

On the flip side, the very same laptop has a much better display than competing MacBook Air, and better in a manner that is observable in its entirety in non-professional casual use: visually better scrolling of web pages and documents; easier to see in bright sun; much more superior HDR movie watching experience.

It also has much better laptop speakers - completely irrelevant for professional use (like music production), but very relevant for movie watching.

Base 14" MBP is merely a better Netflix machine than MacBook Air is.

What you said about Pro, and what the article says about Pro, is very fair and quite excellent in fact for, uhm, actual Pro laptops, which is Mx Pro/Max based machines, not the new base 14" MBP on an Mx (small one) chip.

What I indeed find perplexing is the fact that 2 default 14" Mx MBP configurations that do not require any changes (they often ship faster, being default is a tangible difference) aren't 8 GB 512 and 16 GB 512, but 8 GB 512 and 8 GB 1TB. That I find odd, because I think yes, a small subset of office users might find a low core count but bigger RAM version useful in practice (I'm thinking some very heavy spreadsheet hitters kind of office power users, managers probably), but virtually no one in that category would actually care to have 1TB of SSD with everything being web-based and/or stored on company's servers in corporate environment. That is indeed odd. But maybe I'm wrong in my assumptions for this one - it's easy to be lacking any actual stats or data - and it's appropriate as well. I don't know.