More

suresk · 2025-11-16T07:37:59 1763278679

The opposite problem can happen- the CEO uses the product all the time and becomes blind to problems. “It has always worked that way”, or “who would want to do that!?”” are much more common than pure apathy.

Zardoz84 · 2025-11-16T07:58:04 1763279884

Example: Bill Gates and the weird keyboard shortcuts that Exchange had.

bni · 2025-11-16T19:13:03 1763320383

The "C:/Users" folder on Windows used to be "C:/Documents and Settings"

I remember Bill Gates got that to be changed after an e-mail rant he wrote about how bad Windows had become. This was 2002 or so.

suresk · 2025-09-07T22:00:03 1757282403

They also get massive subsidies and tax breaks for building these data centers. They require the negotiations be done in secret and often fight to keep the agreements secret to make it so people don’t flip out when they see how bad they are.

kelseyfrog · 2025-09-08T02:08:58 1757297338

Who's winning here?

chaz6 · 2025-09-08T08:25:06 1757319906

The wealthy, as usual.

suresk · 2025-03-12T00:51:14 1741740674

Kind of a fun toy problem to play around with. I noticed you had thread coarsening as an option to play around with - there is often some gain to be had here. I think this is also a fun thing to play around with Nsight on - things that are impacting your performance aren't always obvious and it is a pretty good profiler - might be worth playing around with. (I wrote about a fun thing I found with thread coarsening and automatic loop unrolling with Nsight here: https://www.spenceruresk.com/loop-unrolling-gone-bad-e81f66f...)

You may also want to look at other sorting algorithms - common CPU sorting algorithms are hard to maximize GPU hardware with - a network sort like bitonic sorting involves more work (and you have to pad to a power of 2) but often runs much faster on parallel hardware.

I had a fairly naive implementation that would sort 10M in around 10ms on an H100. I'm sure with more work they can get quite a bit faster, but they need to be fairly big to make up for the kernel launch overhead.

suresk · 2025-03-12T00:10:22 1741738222

> I'm surprised you're not touting the "save on your power bill" benefits.

At ~$600/kWh for capacity, the ROI isn't great. I have a pretty big differential on my rates because I have an EV, and even then I'd need over a decade to make the $1,000 back assuming I fully discharged it every day.

suresk · 2025-02-21T06:24:21 1740119061

Are you sure they ditched CUDA? I keep hearing this, but it seems odd because that would be a ton of extra work to entirely ditch it vs selectively employing some ptx in CUDA kernels which is fairly straightforward.

Their paper [1] only mentions using PTX in a few areas to optimize data transfer operations so they don't blow up the L2 cache. This makes intuitive sense to me, since the main limitation of the H800 vs H100 is reduced nvlink bandwidth, which would necessitate doing stuff like this that may not be a common thing for others who have access to H100s.

1. https://arxiv.org/abs/2412.19437

t55 · 2025-02-21T16:12:08 1740154328

I should have been more precise, sorry. Didn't want to imply they entirely ditched CUDA but basically circumvented it in a few areas like you said.

suresk · 2025-02-21T05:45:25 1740116725

I've only done the CUDA side (and not professionally), so I've always wondered how much those skills transfer either way myself. I imagine some of the specific techniques employed are fairly different, but a lot of it is just your mental model for programming, which can be a bit of a shift if you're not used to it.

I'd think things like optimizing for occupancy/memory throughput, ensuring coalesced memory accesses, tuning block sizes, using fast math alternatives, writing parallel algorithms, working with profiling tools like nsight, and things like that are fairly transferable?

suresk · on Jan 13, 2025

So many fond memories of this game - it was a really fun blend of railroad sim and economic sim that I haven't really found since. I'll never forget the "ding ding ding" sound that goes off when a train pulls into a station and earns you a bit of cash!

suresk · on Aug 7, 2024

My non-expert brain immediately jumped to double-pumping + maybe working with their thread director to have tasks using a lot of AVX512 instructions prefer P cores more. It feels like such an obvious solution to a really dumb problem that I assumed there was something simple I was missing.

The register file size makes sense, I didn't think they were that much of the die on those processors but I guess they had to be pretty aggressive to meet power goals?

jsheard · on Aug 7, 2024

> The register file size makes sense, I didn't think they were that much of the die on those processors

https://i.imgur.com/WdMPX8S.jpeg

According to this, Zen4s FP register file is almost as big as its FP execution units. It's a pretty sizable chunk of silicon.

suresk · on Aug 7, 2024

I was having trouble finding an E Core die shot, but that helps put it into perspective a bit anyway. Thanks!

suresk · on July 16, 2024

Having dabbled in CUDA, but not worked on it professionally, it feels like a lot of the complexity isn't really in CUDA/C++, but in the algorithms you have to come up with to really take advantage of the hardware.

Optimizing something for SIMD execution isn't often straightforward and it isn't something a lot of developers encounter outside a few small areas. There are also a lot of hardware architecture considerations you have to work with (memory transfer speed is a big one) to even come close to saturating the compute units.

suresk · on May 10, 2024

Kind of funny, I was just looking at adding similar functionality to an internal Chrome plugin I built because we struggle to get enough useful info in bug reports (being able to look at the HAR, in particular, is useful, but difficult to get users to do correctly).

Two questions -

1) Any way to customize which headers/cookies get scrubbed?

2) Is there a way to get something like the lower-level export you can get by going to chrome://net-export/?

thedg · on May 11, 2024

Ha, that’s cool!

We don’t have a way to customize that yet but we absolutely need to build that

And yes! Network logs get captured and added to the bug report automatically and we have copy as curl too so you can retry the request locally

suresk · on May 11, 2024

This would have been so useful this past week while we debugged something that ended up being a weird combo of Comcast/Cloudflare/http3 - only some people could reliably reproduce it and it was a lot to coach them through all the steps.

Being able to redact a few values would be really important for us (I just wrote Python scripts to clean the har files up), but I’m going to play around with it this weekend.