That's a very well-researched comment! There a few things probably worth noting....

That's a very well-researched comment! There a few things probably worth noting. Roth, Bobko and McFarland have been pretty active in this topic for the past decade. They've found the validity coefficient cited by Schmidt and Hunter in 1998 is likely an over-estimate due to relying on research conducted when there were less rigorous statistical and methodological best-practices.

[1]http://www.psychologie.uni-mannheim.de/cip/Tut/seminare_witt...

The validity coefficient provided by Roth and Bobko is likely more accurate. That isn't to diminish their value as they are still valuable, but the aren't the cure-all we'd like them to be. They do continue to show promise in reduced adverse impact though, which is great (note: the full article is behind a paywall - what is the HN-approved method of sharing the information?):

[2]http://onlinelibrary.wiley.com/doi/10.1111/j.1468-2389.2010....

That is for gender. With regards to ethnicity, the evidence isn't quite as optimistic yet. Like other predictors including cognitive ability tests, if they are showing notable adverse impact you may be in trouble Like other predictors including cognitive ability tests, if they are showing adverse impact you may be in trouble despite their validity.

[3]http://onlinelibrary.wiley.com/doi/10.1111/j.1744-6570.2008....

It's a problem with lots of predictors, though scope of the problem varies. There is work being done all the time, even in the most reliably stalwart predictor, the cognitive ability test:

[4}http://psycnet.apa.org/journals/apl/92/3/794/

Anyone interested in the great "diversity-validity dilemma" can check out this link for more information, though there's always progress. It's a great article.

[5]http://onlinelibrary.wiley.com/doi/10.1111/j.1744-6570.2008....

For my money I endorse integrity tests as a part of the solution. Decent validity, including incremental validity over cognitive ability due to a low correlation between the two, and small sub-group differences.

[6]http://onlinelibrary.wiley.com/doi/10.1111/j.1744-6570.2007....

Having said all that, I imagine the efficacy of work samples is moderated by the type of work, and I'd have to believe they are more amenable to demonstrations of technical skill like coding (I don't know of any references for this now, but I'll look later). Coding-related jobs would be nice because it would be possible to blindly judge on the output as well, and in programming-type jobs it would be much easier and cost-efficient to test large numbers of applicants than it would for many other jobs. Cost and ease of large-scale administrations are their big problems, so overcoming those would be gravy. I don't know how subgroup differences are impacted though.