I'm confidently relaying my experience. But I get that I was extremely terse and overly general in my reply.
I haven't surveyed all the papers, although I have read some. And all the ones that I've seen that work okay -- do so by using a language graph or word association graph in their algorithm. Not just embeddings. Even then the results don't look good to me compared to human performance.
Why does it sound crazy that it wouldn't work well? Have you used word embeddings much? Maybe you have and have good reason to think this - I don't mean to imply otherwise. But it doesn't sound crazy to me that it wouldn't work well.
It’s odd to me that you would confidently claim it’s “beyond awful”.