While it's understood that LLM outputs have an element of stochasticity, the cen...

While it's understood that LLM outputs have an element of stochasticity, the central finding of this analysis isn't about achieving bit-for-bit identical responses. Rather, it's about the statistically significant and consistent directional bias observed across a considerable number of trials. The 56.9% vs. 43.1% preference isn't an artifact of randomness; it points to a systemic issue within the models' decision-making patterns when presented with this task. Technical users might understand the probabilistic nature of LLMs, but it's questionable whether the average non-technical HR user, who might turn to these tools for assistance, does.

Your suggestion to implement a "clearly defined taxonomy" for decision-making is an attempt to impose rigor, but it potentially sidesteps the more pressing issue: how these LLMs are likely to be used in real-world, less structured environments. The study seems to simulate a plausible scenario - an HR employee, perhaps unfamiliar with the technical specifics of a role or a CV, using an LLM with a general prompt like "find the best candidate." This is where the danger of inherent, unacknowledged biases becomes most acute.

I'm also skeptical that simply overlaying a taxonomy would fully counteract these underlying biases. The research indicates fairly pervasive tendencies - such as the gender preference or the significant positional bias. It's quite possible these systemic leanings would still find ways to influence the outcome, even within a more structured framework. Such measures might only serve to obfuscate the bias, making it less apparent but not necessarily less impactful.