Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

With the advent of GPU coding and the number of people exposed to it there is a chance that enough people are willing and able to try unfamiliar architectures if it gives them an advantage that this just might be viable now.


What kind of advantage barrel processors give a regular programmer (especially things that can't be done efficiently with CPU/GPU)?


Much closer communication lines between the two, thread level guarantees that would be hard to match otherwise. A barrel processor will 'hit' the threads far more frequently on tasks that would be hard to adapt to a GPU (so for instance, more branching).

Though CPU/GPU combinations are slowly moving in that direction anyway. Incidentally, if this sort of thing really interests you: when I was playing around with machine learning my trick to see how efficiently my code was using the GPU was really simple: first I ran a graphics benchmark that maxed out the GPU and measured power consumption. Then I did the same for just the CPU. Afterwards while running my own code I'd compare the ratio between my own code's power draw with the maximum obtained during the benchmarks, this gave a pretty good indication of whether or not I had made some large mistake and showed a nice and steady increase with every optimization step.


Could you not also see an impact to run times? If algorithm A draws 100W and runs in 10 seconds, and algo B draws 200W… shouldn’t it be very much sub 10 seconds to warrant being called a better algorithm?


Not really, it's an apples-to-oranges comparison. If you ran the same distributed algorithm on a single core you wouldn't see the same speed improvements. These chips are ridiculously efficient because they need far fewer gates to accomplish the same purpose if you have programmed to them specifically. Just like you couldn't simulate a GPU for the same power budget on a CPU. The loss of generality is more than made up for by the increase in efficiency. This is very similar in that respect, but with the caveat that a GPU is even more specialized and so even more efficient.

Imagine what kind of performance you could get out of hardware that is task specific. That's why for instance crypto mining went through a very rapid set of iterations: CPU->GPU->ASIC in a matter of a few years with an extremely brief blip of programmable hardware somewhere in there as well (FPGA based miners, approximately 2013).

Any loss of generality will result in efficiency and vice versa, the question is whether or not it is economically feasible and there are different points on that line that have resulted in marketable (and profitable) products. But there are also plenty of wrecks.


1. Thread Switching: • Hardware Mechanisms: Specific circuitry used for rapid thread context switching. • Context Preservation: How the state of a thread is saved/restored during switches. • Overhead: The time and resources required for a thread switch. 2. Memory Latency and Hiding: • Prefetching Strategies: Techniques to load data before it’s actually needed. • Cache Optimization: Adjusting cache behavior for optimal thread performance. • Memory Access Patterns: Sequences that reduce wait times and contention. 3. Parallelism: • Dependency Analysis: Identifying dependencies that might hinder parallel execution. • Fine vs. Coarse Parallelism: Granularity of tasks that can be parallelized. • Task Partitioning: Dividing tasks effectively among available threads. 4. Multithreading Models: • Static vs. Dynamic Multithreading: Predetermined vs. on-the-fly thread allocation. • Simultaneous Multithreading (SMT): Running multiple threads on one core. • Hardware Threads vs. Software Threads: Distinguishing threads at the CPU vs. OS level. 5. Data Structures: • Lock-free Data Structures: Structures designed for concurrent access without locks. • Thread-local Storage: Memory that’s specific to a thread. • Efficient Queue Designs: Optimizing data structures like queues for thread communication. 6. Synchronization Mechanisms: • Barrier Synchronization: Ensuring threads reach a point before proceeding. • Atomic Operations: Operations that complete without interruption. • Locks: Techniques to avoid common pitfalls like deadlocks and contention. 7. Stall Causes and Mitigation: • Instruction Dependencies: Managing data and control dependencies. • IO-bound vs. CPU-bound: Balancing IO and CPU workloads. • Handling Page Faults: Strategies for minimal disruption during memory page issues. 8. Instruction Pipelining: • Pipeline Hazards: Situations that can disrupt the smooth flow in a pipeline. • Out-of-Order Execution: Executing instructions out of their original order for efficiency. • Branch Prediction: Guessing the outcome of conditional operations to prepare.

There are some of the complications related to barrel processing


people aren't really doing GPU coding; they're calling CUDA libraries. As with other Intel projects this needs dev support to survive and Intel has not been good at providing it. Consider the fate of the Xeon Phi and Omnipath.


Huh that is a solid point, how big a fraction would you estimate to be actually capable of doing real GPU coding?


I don't think I have a broad enough perspective of the industry to assess that. In my corner of the world (scientific computing) it's probably three quarters simple use of accelerators and one quarter actual architecture based on the GPU itself.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: