Given the effort the play a game and the inter-subject noise in a numerical judgement, I think a better use of Mechanical Turk would be to ask the player to do some blame assignment. That is, instead of rating a game as 4-stars, they could perhaps give a thumbs up/down to particular rules or sets of rules in abstracted representation of the game.
This kind of feedback extracts several more bits of information from the player than a single rating (making better use of them). However, it breaks the applicability of an evolutionary algorithm which treats both artifacts and fitness evaluation as black boxes. If you use a search algorithm that was aware of how the game was built from components, I'm guessing that component-level feedback (being both more objective and more specific) could provide more informative pressure to drive the algorithm than the standard interactive genetic algorithm setup gets.
This kind of feedback extracts several more bits of information from the player than a single rating (making better use of them). However, it breaks the applicability of an evolutionary algorithm which treats both artifacts and fitness evaluation as black boxes. If you use a search algorithm that was aware of how the game was built from components, I'm guessing that component-level feedback (being both more objective and more specific) could provide more informative pressure to drive the algorithm than the standard interactive genetic algorithm setup gets.