ChatGPT and Copilot both frequently hallucinate entire plausible-sounding APIs o...

ChatGPT and Copilot both frequently hallucinate entire plausible-sounding APIs or properties/methods that simply don’t exist. Often this takes a lot of time to debug/resolve. They also often omit proper input validation, error handling, and logging even when prompted to include it.

I want to do a structured efficiency study of programming tasks of human vs human-plus-AI from problem statement through production-ready code. But my org doesn’t have enough devs to make it statistically significant, nor the spare human capacity to invest on duplicative tasks. I assume there must be some studies out there anyone have a reference?