Always enjoy these articles. I do think that the simulation strategy deserves a bit more discussion, and the pitfalls of replays are glossed over slightly.
Simulating players gives you the benefit of end-to-end testing much more easily (for example, it tests your login flow and disconnect flow, both of which are often written without much care for performance). It also practically demands that you write a scripting console for your game, which is an invaluable resource for error testing, debugging and development both by your engine and content developers. It's true that human players are more unpredictable than bots, but that's not a drawback, because you want to stress-test the hottest part of your game. For these purposes, you can write a small test level (think an arena) and have bots spawn at either end and run towards each other, attacking until they die. This isn't at all representative of how players will play the game but it will very definitely catch performance regressions.
Replays depend on two factors, deterministic playback and unchanging content. Deterministic playback is possible (though extremely difficult) to achieve, but it really has to be a priority from the beginning, and it has to have its own testing suite and fixing regressions has to be a priority. It's impossible to retro-fit it onto an existing game (sometimes by design, as non-determinism occasionally is mistaken for randomness). Unchanging content is, obviously, unlikely to happen during development and is much more likely to be possible during betas, but during those time periods you have much more time to devote programmer and content developer time and attention to ancillary features like expanding bot programming.
The upside for the simulation problem is that many multiplayer games have something approaching bot AI to begin with - enemy AI - so you can use that as a starting point for those sorts of tests. I'm not sure if that was ever done at scale for GW testing, but it would have been possible...
My understanding is that the replays were possible because the game engine was architected from the beginning with the server 100% in control of game events and the clients relying on message-passing to communicate with the server. This means that the recording captured on the server is a definitive recording of what actually happened in the game. Clients could desync, so a replay might not reproduce that, but otherwise they were pretty accurate.
I'm not sure how the unchanging content problem was solved. I think everything was revision-controlled aggressively enough that it would have been possible to set up a local server instance using old content, and it's also possible that most ordinary changes (altering a texture, a map, etc) wouldn't actually change the results of a replay. Definitely a challenge, though.
Replays came in handy later on for other features (like letting players watch others' PvP matches in-game) so they probably ended up carrying their weight in terms of development effort. I do wonder if they felt like a burden initially, though, before their value became clear...
Dealing with the many scenarios that could break replays was problematic, but being able to run several instances of the same game at the same time and compare them occasionally for desyncs can help alleviate that. For example, using a pointer value (address) as a hash table or sort key means that depending on allocation order, iteration of game entities may be different from one run to another.
I disagree with him on bots, though. CCP has a highly evolved bot system which is tied into their build system, so that unit tests can include player behavior. This isn't directly relevant to load testing (although CCP does that with bots as well), but it certainly demonstrates why bots are worth the effort.
And if you've got a full-fledged QA department, bot writing is a good QA engineer task. This is also a decent way to give ambitious QA guys a coding task that doesn't have the potential to affect customer-facing game play.
Without violating NDAs I can say I've worked at MMO companies that found bots invaluable for load testing. Replays will definitely catch things that bots won't, but if you're thinking about how many players you can fit in a zone, etc., etc. bots are great.
Pat's speaking for himself here, of course, as evidenced by the fact that bots were crucial for load testing GW2 (they ran on Amazon EC2, even, so that they could scale dramatically high.)
A very interesting read, but I'm missing one step:
Game recording is established as a favorable way to make games better, to avoid a crappy beta experience or problems right after launch.
How are you going to get decent recordings in the first place? Playing with a couple developers seems a lousy way to jump start a good selection of games, is it? How is this chicken and egg problem solved? I miss a way to bootstrap the process, load a good set of (varied, representative) replays?
This is pretty much exactly how it worked when I was there. I imagine when the team was smaller, they leveraged a combination of developer recordings + recordings from volunteer testers (the game had a group of volunteer testers in the high hundreds by the time I started working on it).
In particular QA recordings got used a lot to reproduce particularly tricky bugs and race conditions, since players usually weren't savvy enough to spot them.
The volunteer tester group was also useful for load testing, since it gave us a way to throw a hundred or so players at one of our development servers and record all of them simultaneously. It gave a reasonable approximation of what the load might look like on a single production server and it meant we could also capture recordings of how the servers (and clients) behaved under load instead of just having 4-8 developers playing on them.
QA testers could write recordings out, sure, but taking a recording from a crash and then playing it back locally to 1. inspect the crash with full debugging powers (stepping through code, seeing the full heap, etc.) and 2. being able to fix the crash and re-run the recording to verify the fix corrected the problem, were both a more common and valuable use of the recordings.
Any comments on the infrastructure for capturing the replays? Did everything get recorded? I'm used to recording all actions, ability uses, etc. but recording movement isn't something I've seen done all that often and it seems like it'd be another order of magnitude of data.
Think of a web server with a fresh SQL database with no data (other than whatever is pre-populated there by the installation.) Simply record all incoming messages to the server over a period of time and save it to disk. Then, when you want to replay that recording, start with a fresh database again and replay all the messages. Assuming your system is deterministic (no race conditions, generally not relying on a wall clock [there are tricks for this, since it's a harder problem[1]]) and you're good to go. The key is designing everything to support this from the beginning.
[1] Done right, you can play a minutes long recording back in seconds, which is very useful for trying to fix a bug and verifying it against a known-bad replay.
I was confused about the part where he mentions a rack full of 1Us was generating too much heat and the solution was to put less servers in a rack. I can't think of a situation where there is nothing at fault besides the servers. The chillers should be able to keep up with 1Us like that, and it's likely there was something else going on.
The situation we're running into at my job is with double density servers (i.e. Dell C6105's). The problem we're having with heating double density servers. Maybe somebody with more experience than me can explain why cooling 1Us would ever be a legitimate problem?
"I can't think of a situation where there is nothing at fault besides the servers. The chillers should be able to keep up with 1Us like that, and it's likely there was something else going on.
...
Maybe somebody with more experience than me can explain why cooling 1Us would ever be a legitimate problem?"
The situation in server labs, hardware and infrastructure-wise was much worse seven years ago. Is it that hard to imagine?
I was in high school 7 years ago so yes, it is hard to imagine because I really don't know how things used to be. I do appreciate the condescending tone though.
Depends on the datacenter. Any modern datacenter won't have an issue. However, if you didn't do your research and wound up in something that wasn't built to current specs, I can easily imagine a 42U rack overloading.
Similar example: there's a tier 1 network provider with a solid datacenter business; I had servers in one of their facilities during the period in which blade servers were becoming common. They literally ran out of power; we had some empty racks with a contract allowing us to turn on power for them at any point and they couldn't fulfill it.
Looking back on the specs for 80s and 90s datacenters can be really amusing in retrospect.
I can think of at least one MMO who had to buy 50% more floor space in their cage because their blade server deployment exceeded what the colo provider could provide in terms of power/cooling per square foot. They literally had to leave an empty space to the left and right of each rack.
DCs have been constrained on power and cooling for the last decade or so. Existing colo sites are most likely to have 5kva (4kw) rack positions. That's only 90 watts/ru. Newer facilities will have 10kva racks. Minus a couple ru for cabling and tor and the power budget is 250 watts/ru.
Simulating players gives you the benefit of end-to-end testing much more easily (for example, it tests your login flow and disconnect flow, both of which are often written without much care for performance). It also practically demands that you write a scripting console for your game, which is an invaluable resource for error testing, debugging and development both by your engine and content developers. It's true that human players are more unpredictable than bots, but that's not a drawback, because you want to stress-test the hottest part of your game. For these purposes, you can write a small test level (think an arena) and have bots spawn at either end and run towards each other, attacking until they die. This isn't at all representative of how players will play the game but it will very definitely catch performance regressions.
Replays depend on two factors, deterministic playback and unchanging content. Deterministic playback is possible (though extremely difficult) to achieve, but it really has to be a priority from the beginning, and it has to have its own testing suite and fixing regressions has to be a priority. It's impossible to retro-fit it onto an existing game (sometimes by design, as non-determinism occasionally is mistaken for randomness). Unchanging content is, obviously, unlikely to happen during development and is much more likely to be possible during betas, but during those time periods you have much more time to devote programmer and content developer time and attention to ancillary features like expanding bot programming.