We (the PLASMA research group at UMass, http://plasma.cs.umass.edu) developed a system called AutoMan specifically designed to automatically manage quality (as well as to automatically compute pay and time) for a wide variety of tasks. You basically invoke people as functions and it just works, with statistical guarantees (it also handles payment, etc. without any additional effort). Makes dealing with MTurk much nicer. Best used in Scala but also can be used from Java.
We're working on a lot of the same things at Scale API (www.scaleapi.com). Starting with a higher quality set of task-completers, and building in similar statistical guarantees for our tasks.
One of the things we work on is building quality for responses that are little more complex (bounding boxes and audio transcription, for example). I'd be interested to see if we can apply some of your learnings to those task types!
http://automan-lang.com https://github.com/plasma-umass/AutoMan
Paper here on AutoMan, round one: * http://cacm.acm.org/magazines/2016/6/202648-automan/abstract (CACM Research Highlight, 2015)
Original paper, not behind a paywall: * https://people.cs.umass.edu/~emery/pubs/res0007-barowy.pdf (OOPSLA '12)
New features described here have been rolled into AutoMan: "VoxPL: Programming with the Wisdom of the Crowd" (CHI '17, to appear): https://people.cs.umass.edu/~emery/pubs/voxpl-chi.pdf