Given enough money, you can outsource the error function to mechanical turk. Or to go more meta (and possibly cheaper): Make designing a good error function part of a meta GA-run, and let those functions be judged by comparing their judgements with human judgements.
Given the effort the play a game and the inter-subject noise in a numerical judgement, I think a better use of Mechanical Turk would be to ask the player to do some blame assignment. That is, instead of rating a game as 4-stars, they could perhaps give a thumbs up/down to particular rules or sets of rules in abstracted representation of the game.
This kind of feedback extracts several more bits of information from the player than a single rating (making better use of them). However, it breaks the applicability of an evolutionary algorithm which treats both artifacts and fitness evaluation as black boxes. If you use a search algorithm that was aware of how the game was built from components, I'm guessing that component-level feedback (being both more objective and more specific) could provide more informative pressure to drive the algorithm than the standard interactive genetic algorithm setup gets.
Why not go full Turk? Have a Turker suggest a new rule/modification. It will be paid for only if accepted by another, who must play N recorded moves of the game with a friend before passing judgment. Iterate.