I see you aren't familiar with modern state of Minecraft servers. Due to Minecraft being limited to a one core big servers actually aren't a single instance. They use proxy servers(such as BungeeCord and it's forks) which distributes load between several lobby servers and from there people join one of custom gamemodes(Skyblock, Bedwars, etc). This allows for tens of thousands of people to play simultaneously, but not in the single world, while SMP(Survival Multiplayer) servers can run couple of hundreds at most. These giant servers are heavily containerized and automatically scale under load, so spinning up and shutting down servers is a pretty normal thing. And there have been some attempts to make Minecraft to run a single world on multiple instances(MultiPaper and some private ones), so even for usual SMP server it can be a commonplace soon as player join and leave.
> And there have been some attempts to make Minecraft to run a single world on multiple instances(MultiPaper and some private ones)
First time I hear about MultiPaper, another idea I had which I din't know someone was already working on LOL. It's a pretty promising idea considering the current performance problems of the game. This could possibly allow thousands of players in the same server which would be AMAZING, almost a completely different game. Imagine if MultiPaper was compiled to native using GraalVM.
I do enjoy how reviewers are now using software rendering of Crysis as a benchmark, even if it's just a joke. It would be pretty interesting to see what kind of game engine you could make if it was hyper-optimized around software rendering for modern, massively parallel CPUs.
ZBrush has entered the chat. Not a game engine, but it's being doing that since the first version many years ago and always surprises me how it outperforms most GPUs. Rendering (and interactively manipulating!) billions of polygons.
Author here - Thanks, I'm proud of that test! It wasn't as easy as just running the thing to figure out, as it crashes on high-end CPUs without the right affinity mask. I've done a couple of videos on it on my YouTube channel. It has two main limitations: 32 threads or 23 cores. That ultimately limits how good Crysis can be on it (you need 23 really fast ST cores), currently around the 17.7 FPS mark at 1080p Low settings.
probably not a very useful one. amd's high core count parts are certainly impressive, but it's hard to beat >3000 dumb cores that are all doing more or less the same thing anyway.
"CUDA Core" is a misleading marketing term. It merely counts the number of FP32 ALUs. If you measured CPUs this way, you'd have up to 16 "cores" per core. If you count it that way, that makes this CPU is roughly as powerful as the GTX 400 series with it's "512 cores". A brief look at benchmarks seems to confirm that.
Conversely if you counted a GPUs independent parallel threads of execution like on CPUs, you'd max out at a very comparable 64 "cores".
SIMT is a much more convenient programming model than SIMD for wide compute, though. So much so that I think the marketing around CUDA cores isn't merely reasonable but for many applications actually strikes closer to the truth than counting ALUs or independent threads.
If you count ALUs, you see that the CPU has many of them, but you don't see how difficult it is to chunk up data to keep those fed.
If you count independent threads, you see that the GPU has few of them, but you don't see how it conceptually has many threads, which simplifies the programming of each thread while gracefully degrading only in proportion to how much branching you actually use.
I definitely agree with it being a useful model, after all that is the the way GPUs are presented to programmers. But when getting into detailed performance comparisons, especially with CPUs, that simplified model breaks down quickly and becomes a hindrance. Which is why I think it's unfortunate that NVidia (deliberately) uses the term "core" that leads people to believe they can make a direct comparison to CPUs. Comparisons that coincidentally lead people to believe GPUs are much more special than they actually are.
In detailed comparisons all simplified models break down. Tallying max theoretical compute is only appropriate if you're going to put in the effort to actually use it, which is exceedingly rare, even in compute kernels that have supposedly had lots of love and attention already paid to them. So the human factor has to be included in the model and the human factor consistently de-rates SIMD more than it does SIMT.
I realize that under the covers this is more of a compiler/language thing than a compute model thing, but for whatever reason I just don't see much SIMT code targeting CPUs, so again, human factor.
My primary objection is not to the SIMT model, but NVidia reusing a term with an established meaning ("core") for something completely different and incomparable. Other companies terms like "execution units", "compute units" and "stream processors" (although perhaps not as much the last) are much more truthful about the nature of GPUs without hindering the programming model at all.
From the standpoint of a programmer who doesn't want to suffer through constant DIY chunking and packing (very close to my personal definition of hell), a CUDA core looks a lot like a CPU core. From the point of view of someone not writing the code, a CUDA core looks like merely another FPU.
Yes, "CUDA Core" is (was?) analogous to "FP32 ALU" - (it may have transitioned to FP16 nowadays to double the count?)
Yes, EPYC has 16 FP32 ALU per Core.
But, single GPU "Compute Units" often have 2-4x that (up-to 64 FP32 ALU), and a lot of other differences which when added up at the whole GPU level don't mean they max out anywhere near '64 "cores"' "if you counted a GPUs independent parallel threads of execution like on CPUs"
That's where they couldn't be more different!
Also, for accuracy, vis-a-vis "independent parallel threads of execution like on CPUs", even the 64-core Threadripper has 128, not 64.
And many GPUs with 64 physical "Compute Units" would have at least 640 by the same measure (since many have 10 independent programs which can be resident in hardware at once, compared to SMT2 on Threadripper).
But when you break it down further, while a model GPU Compute Unit from a few years ago might have 64x FP32 ALUs in hardware, that CU has the capacity to schedule and execute single instructions on 10*64 logical threads (= 640 logical threaded instructions in-flight on a single CU at once). On a 64CU chip, that's 40,960 logical floating point instructions in flight at once, each one of these coming from a different logical thread of execution in a SIMT model.
The superscalar CPU cores can also have lots of instructions in flight, but they are more deeply pipelining instructions for fewer threads of execution (focused on single-threaded performance).
This is all a long way of saying that a.) you may not have been so far off when comparing raw ALU counts; but, b.) you couldn't be misrepresenting the facts more when comparing the differences in "parallel threads of execution."
This is a core architectural difference between CPUs and GPUs, and while yes, there are similarities in ALU count, the way the transistors and ALUs are utilized by software is quite different.
To oversimplify, and ignoring performance of applying these different programming models to different compute architectures:
Using an example 64-CU GPU compared to 64-Core EPYC:
64 CU GPU as SIMD: 640 Threads (SMT10) and 4096 FP32 ALUs (SIMD64 - actually 4xSIMD16)
64 Core CPU as SIMD: 128 Threads (SMT2) and 1024 FP32 ALUs (SIMD16 - actually 2xSIMD8)
64 CU GPU as SIMT: 40,960 Threads (10:1 Threads:ALUs)
64 Core CPU as SIMT: 2,048 Threads (2:1 Threads:ALUs)
So, this model 64-CU GPU can schedule 20x more logical hardware threads in a SIMT programming model than the 64-Core EPYC CPU, despite having only 4x the FP32 ALUs.
My final caveat would be that GPUs are over time increasing single-threaded performance, and CPUs are becoming wider. So, in a way, there is some architectural convergence - but it's not to be overstated.
One last note: Niagara was a bit GPU-like, with round-robin thread scheduling, and POWER now has SMT8. But differences remain.
Hey, thanks for the detailed comment. I should note I didn't intend to create the impression that GPUs and CPUs architectures were completely interchangeable. That's obviously not true, because of the fixed function hardware alone. The intent was more to give a rough idea of what "CUDA Core" actually means and roughly how those concepts map to what we know in CPUs.
> Yes, "CUDA Core" is (was?) analogous to "FP32 ALU" - (it may have transitioned to FP16 nowadays to double the count?)
It's still FP32 ALUs (NVidia disables FP16 in the driver for gaming cards). The doubling between Turing and Ampere is due to the combined int/fp units being counted as CUDA cores. This also means that for int instructions, the expexted performance is not actually increased, another reason the "core" term is unhelpful.
> But, single GPU "Compute Units" often have 2-4x that (up-to 64 FP32 ALU)
Yep, guess I should have left that AVX2048 joke in there for you. It just wasn't very important IMO.
> Also, for accuracy, vis-a-vis "independent parallel threads of execution like on CPUs", even the 64-core Threadripper has 128, not 64.
I did not consider SMT here because while relevant for keeping the ALUs fed with instructions, they don't increase the amount of raw peak compute possible. Those 128 threads can still only issue 64 avx instructions at any one time, which is what really matters here.
While not all the new features will be in this release, the rate of improvement for Blender's sculpting tools has been astounding. For example, the new cloth brush:
As someone who has used both PyTorch and TensorFlow for a couple years now, I can can attest to the faster research iteration times for PyTorch. TensorFlow has always felt like it was designed for some mythical researcher that could come up with a complete architecture ahead of time, based on off-the-shelf parts.
Indeed, no wonder PyTorch has beaten Tensorflow so thoroughly in the last 3 years, going up from 1% of the papers to ~50% of the papers (TensorFlow is now down to only 23% of the papers):
According to the methodology on that page that would classify the standalone version of Keras (using from keras.models imports as recommended by the Keras docs) as "Other". (I tried finding source code to verify this, but couldn't find it)
And if that is correct, then I'd be astonished if the vast majority of the "Other" papers aren't Keras. I work in ML and I don't think I've seen a paper that didn't use PyTorch, TensorFlow or Keras in years.
And is that's the case then almost certainly there are more that use TF than PyTorch: Pytorch is 42%, TF is 23% but Other is 36%.
(In terms of biases, I hate working in Tensorflow, and much prefer PyTorch and Keras. But numbers are numbers).
Are there any papers that use it for things other than demonstrating Jax? I can't think of one off the top of my head.
Perhaps I should have specified "papers outside those introducing new frameworks, or around speed benchmarking".
There are a bunch of interesting papers using custom libraries for distributed training, and ones targeted at showing off the performance of specific hardware (NVidia has a bunch of interesting work in this space, and Intel and other smaller vendors have done things too).
Keras is pretty good unless you hit some custom loss function that needs to do operations that aren't defined in Keras' backend, then you suddenly have to switch over to write them in TensorFlow with some ugly consequences (sometimes you don't know which operations will be GPU-accelerated; slicing vectors to compute and aggregate partial loss functions with some complicated math formulas might force computation onto a CPU).
This doesn't have much to do with the algorithms, and is more to do with the engineering decisions that went into AlphaGo and AlphaZero. They are designed to play one combinatorial game really well. With a bit of additional efffort and a lot of additional compute, you could expand the model to account for multiple rule / scale variations, maybe even different combinatorial games.
I think it's quite important to look at the distinction between the actual agent in play and the learning algorithm used.
The learning algorithm AlphaGo uses is somewhat general, and can handle different games (e.g. you can put chess or Go through the algorithm and it functions well for either).
The output of this algorithm, however, is a specialised agent. The agent is not general. If I create a chess agent and give it Go or chess with different rules, it will perform very poorly.
Creating general learning algorithms is arguably a somewhat easier task than creating a general agent, since learning algorithms are typically run for a long time while an agent often has to make time constrained decisions.
The holy grail of AGI is to make the learning algorithm and the agent the same thing, and have them be general. Then you have an agent which can rapidly adapt to its environment and self-modify as needed. We are still a long way off a system that would do this in terms of current research.
The distinction you’re making between agent and algorithm is meaningless for the point I was trying to make, which is that the only connection between this DeepMind research (agent, algorithm, whatever) and AGI have in common is the word “general”.
Their “general learning” tech doesn’t even generalize to barely modified variants of the original games it has claimed to master. I call bullshit.
> Their “general learning” tech doesn’t even generalize to barely modified variants of the original games it has claimed to master. I call bullshit.
But the point I was making is precisely that the "general learning" tech is in fact somewhat general. AlphaGo and certainly AlphaZero's learning tech generalises to Go, chess, and a few other games. That's relatively general in the domain of board games, in my humble opinion.
The reason this isn't close to AGI is because it's not the agent doing the learning, and so while a relatively general learning algorithm produces the agent, the agent itself is not general even in the field of board games.
You appear to be completely missing the point of my root comment, which is that AlphaGo’s tech isn’t nearly as general as it’s made out to be, even if you stick to Go.
> AlphaGo can play very well on a 19x19 board but actually has to be retrained to play on a rectangular board.
It doesn’t even generalize to the same game with a different board shape. Whereas a human Go master could easily do so.
DeepMind is essentially hacking the common usage of the word “general” in order so that they can make claims about “general” intelligence. And it’s working!
But the training process does generalise. The same training process produces an agent that works on a 19x19 board, or a standard Go board, or even a game of chess.
How is that not general? Sure it doesn't work for all problems but in the domain of board games it definitely feels very general.
The agent the training algorithm produces may not be general, but out of what I've read I've only ever seen DeepMind claim generality of the learning algorithm, not the agent.
I think the GP was noting the problem that AI can easily encounter situations beyond what it was designed and simply fail while human intelligence involves a more robust combination of behaviors and thus humans can generalize in a much wider variety of situations.
If the system designer has to know the parameters of the challenge the system is up again, it should be obvious you can always add another parameter that the designer didn't know about and get a situation where the system will fail. This is much more of a problem in "real world situations" which no designer can fully describe.
One thing Microsoft has really nailed, is the location diversity of Azure datacenters. They have multiple locations in South Africa and UAE, while also having regions in multiple cities across Australia, Japan and India. If I wanted to launch a worldwide product, why would I want to go with AWS, Google or IBM?
Many cloud customers have regulatory or business requirements that their data be held within particular regions or national boundaries. It should also help somewhat with latency, though in most cases CDN points of presence will be more critical.
Except, those regulatory or business requirements also have business continuity and disaster recovery requirements from those same regulators, and Azure's region design is far inferior than alternatives.
For anyone wanting to jump in now, here's a good guide for making your first model in Blender 2.8, even if you have never made a model before: https://www.youtube.com/watch?v=jBqYTgaFDxU
It's just a matter of following the money. A bigger and bigger slice of Microsoft's income is coming from Azure, and an increasing proportion of Azure users are running Linux. This gives them strong incentive to be the business leaders in open source software.
The other increasing slice is services, where they get you to buy stuff from their storefront, or subscribe to their game pass, or office 365, or make personal interactions with Bing and Windows (so they can sell targeted ads). This naturally gives Microsoft the incentive to know everything about you.
Following the money is exactly why I am skeptical. As Linux becomes a revenue stream for Microsoft it will also become a Microsoft product. From the company that was built on Embrace, Extend, Extinguish I think that is cause for concern.