Given the effort the play a game and the inter-subject noise in a numerical judgement, I think a better use of Mechanical Turk would be to ask the player to do some blame assignment. That is, instead of rating a game as 4-stars, they could perhaps give a thumbs up/down to particular rules or sets of rules in abstracted representation of the game.
This kind of feedback extracts several more bits of information from the player than a single rating (making better use of them). However, it breaks the applicability of an evolutionary algorithm which treats both artifacts and fitness evaluation as black boxes. If you use a search algorithm that was aware of how the game was built from components, I'm guessing that component-level feedback (being both more objective and more specific) could provide more informative pressure to drive the algorithm than the standard interactive genetic algorithm setup gets.
Here's my restatement of the statistical approach (correct me if I'm wrong):
Instead of proposing some metric for fun that is valid a-priori, we are going to look for the very grounded game-specific correlates of fun and optimize with respect to those. Satisfactorily capturing the entire fuzzy sense of human fun in a finite piece of code is silly, but cataloging the observed feedback for every popular combination of level elements (e.g. "goomba two blocks to the right of a coin block while the player has a fire flower") is both practical and useful for automating level design.
It's natural to question how someone can unproblematically pose a metric on the value of game rulesets (particularly without a generous helping of philosophy and psychology to back it up). However, pragmatically, it's not a silly thing to try. If there were game-design IDEs in the future, we'd like them to provide the equivalent of spell/grammar check -- if there was an alternate design with an edit distance of two away from the current design that scored dramatically better on some canned metric, it might be worth putting a human eye on that alternative design.
If you think of the error function as the combined output of a bunch of "obvious flaw" detectors instead of a "theory of fun", then speculative optimization where the initial conditions come from a human's design under consideration becomes a potentially interesting bit of design automation. Think of design rule checks in CAD with a bit of fuzziness and a default bias -- instead of saying just yes/no, it can say "X might be a better alternative according to the metrics you've enabled in the preference window, consider adopting some of its edits".
I'm in grad school for AI right now as a result of reading about Eurisko as a kid. Though I haven't followed up on the greater vision, my thesis proposal a few years ago pitched the project of building a Eurisko-style discovery system for game design: http://users.soe.ucsc.edu/~amsmith/proposal/amsmith-proposal...
Also, the fact that I had a good experience playing an "academic" class character in Traveller during college was also an influence in deciding to go to grad school. I haven't seen this class in any other RPG.
Before you jump to evolutionary algorithms, note "The Problem with Evolution" section in the OP. This was written by a guy who knows what he's doing with evolution, and it's interesting to see him turning to MCTS given his previous success with evolution.
The gist is that games are fragile with respect to mutations (motivating the need for a search paradigm not based on gradualism). This fragility may not exhibit itself as strongly for the parameter tweaking involved in the balancing problem, but balancing quickly shades into general game design when you start to consider non-trivial tweaks.
The project I was thinking of was going to be a tower defense game, and I wasn't planning on trying to develop any rules automatically. Rather, I was thinking in terms of using an evolutionary system to pick weapon and enemy strengths that lead to balanced gameplay.
So, I would probably get away with it, but I'm wide open to different ideas.
You might enjoy answer set programming (ASP). It's a different take on logic programming based on the idea of automatically transforming your problem into a SAT-like representation and then running fancy conflict-driven constraint learning solvers on it. In many solvers, you give up Turing completeness in exchange for being really productive with solving NP-complete problems (similar to regex engines turning you into a bad-ass string matching programmer).
I had the core idea of this in 2006 (binary coded patterns of small magnets for position and orientation sensitivity), but... didn't follow up. Looks like they are doing some interesting stuff with it (and piling on the patents).
The original application I imagined would be very simple arrays to be used in kids' tile puzzles where you could have square-peg-in-square-hole style thinking but without hole geometry (choosing something more semantic instead). Puzzle tiles would only have strong attraction to their correct location and all other locations would be relatively neutral.
If so, then it's all the more true that now is the time to invest attention in learning how the internet works at all of its layers. When the time comes, you can be part of those who build the next internet-like thing with full knowledge of all the weaknesses of the previous one.