To be sure, a lot of this can be blamed on using AI studio to ask a small model a factual question. It's the raw LLM output of a highly compressed model, it's not meant to be everyday user facing like the default Gemini models, and doesn't have the same web search and fact checking behind the scenes.
On the other hand, training a small model to hallucinate less would be a significant development. Perhaps with post-training fine-tuning, after getting a sense of what depth of factual knowledge the model has actually absorbed, adding a chunk of training samples with a question that goes beyond the model's fact knowledge limitations, and the model responding "Sorry, I'm a small language model and that question is out of my depth." I know we all hate refusals but surely there's room to improve them.
All of these techniques just push the problems around so far. And anything short of 100% accurate is a 100% failure in any single problematic instance.
On the other hand, training a small model to hallucinate less would be a significant development. Perhaps with post-training fine-tuning, after getting a sense of what depth of factual knowledge the model has actually absorbed, adding a chunk of training samples with a question that goes beyond the model's fact knowledge limitations, and the model responding "Sorry, I'm a small language model and that question is out of my depth." I know we all hate refusals but surely there's room to improve them.