They're consistent to the model, particularly if you ask the model to rationalize its rating. You will get plenty of hallucinated answers that the model can recognize as hallucinations and give a low rating to in the same response.
Models can get caught by what they start to say early. So if they model goes down a path that seems like a likely answer early on, and that ends up being a false lead or dead end, they will end up making up something plausible sounding to try and finish that line of thought even if it's wrong. This is why chain of thought and other "pre-answer" techniques improve results.
Because of the way transformers work, they have very good hindsight, so they can realize that they've just said things that are incorrect much more often than they can avoid saying incorrect things.