The opposite problem can happen- the CEO uses the product all the time and becomes blind to problems. “It has always worked that way”, or “who would want to do that!?”” are much more common than pure apathy.
They also get massive subsidies and tax breaks for building these data centers. They require the negotiations be done in secret and often fight to keep the agreements secret to make it so people don’t flip out when they see how bad they are.
Kind of a fun toy problem to play around with. I noticed you had thread coarsening as an option to play around with - there is often some gain to be had here. I think this is also a fun thing to play around with Nsight on - things that are impacting your performance aren't always obvious and it is a pretty good profiler - might be worth playing around with. (I wrote about a fun thing I found with thread coarsening and automatic loop unrolling with Nsight here: https://www.spenceruresk.com/loop-unrolling-gone-bad-e81f66f...)
You may also want to look at other sorting algorithms - common CPU sorting algorithms are hard to maximize GPU hardware with - a network sort like bitonic sorting involves more work (and you have to pad to a power of 2) but often runs much faster on parallel hardware.
I had a fairly naive implementation that would sort 10M in around 10ms on an H100. I'm sure with more work they can get quite a bit faster, but they need to be fairly big to make up for the kernel launch overhead.
> I'm surprised you're not touting the "save on your power bill" benefits.
At ~$600/kWh for capacity, the ROI isn't great. I have a pretty big differential on my rates because I have an EV, and even then I'd need over a decade to make the $1,000 back assuming I fully discharged it every day.
Are you sure they ditched CUDA? I keep hearing this, but it seems odd because that would be a ton of extra work to entirely ditch it vs selectively employing some ptx in CUDA kernels which is fairly straightforward.
Their paper [1] only mentions using PTX in a few areas to optimize data transfer operations so they don't blow up the L2 cache. This makes intuitive sense to me, since the main limitation of the H800 vs H100 is reduced nvlink bandwidth, which would necessitate doing stuff like this that may not be a common thing for others who have access to H100s.
I've only done the CUDA side (and not professionally), so I've always wondered how much those skills transfer either way myself. I imagine some of the specific techniques employed are fairly different, but a lot of it is just your mental model for programming, which can be a bit of a shift if you're not used to it.
I'd think things like optimizing for occupancy/memory throughput, ensuring coalesced memory accesses, tuning block sizes, using fast math alternatives, writing parallel algorithms, working with profiling tools like nsight, and things like that are fairly transferable?
So many fond memories of this game - it was a really fun blend of railroad sim and economic sim that I haven't really found since. I'll never forget the "ding ding ding" sound that goes off when a train pulls into a station and earns you a bit of cash!
My non-expert brain immediately jumped to double-pumping + maybe working with their thread director to have tasks using a lot of AVX512 instructions prefer P cores more. It feels like such an obvious solution to a really dumb problem that I assumed there was something simple I was missing.
The register file size makes sense, I didn't think they were that much of the die on those processors but I guess they had to be pretty aggressive to meet power goals?
Having dabbled in CUDA, but not worked on it professionally, it feels like a lot of the complexity isn't really in CUDA/C++, but in the algorithms you have to come up with to really take advantage of the hardware.
Optimizing something for SIMD execution isn't often straightforward and it isn't something a lot of developers encounter outside a few small areas. There are also a lot of hardware architecture considerations you have to work with (memory transfer speed is a big one) to even come close to saturating the compute units.
Kind of funny, I was just looking at adding similar functionality to an internal Chrome plugin I built because we struggle to get enough useful info in bug reports (being able to look at the HAR, in particular, is useful, but difficult to get users to do correctly).
Two questions -
1) Any way to customize which headers/cookies get scrubbed?
2) Is there a way to get something like the lower-level export you can get by going to chrome://net-export/?
This would have been so useful this past week while we debugged something that ended up being a weird combo of Comcast/Cloudflare/http3 - only some people could reliably reproduce it and it was a lot to coach them through all the steps.
Being able to redact a few values would be really important for us (I just wrote Python scripts to clean the har files up), but I’m going to play around with it this weekend.