Interesting, the current DB dump is not for everyone, but if they also offer an ...

yreg · 2024-05-21T22:38:31 1716331111

AFAIK a good way to provide better answers and avoid hallucinations would be to compute embeddings for all sections of text in Wikipedia and then when a user asks a question create an embedding from that question.

Use it to find the X closest embeddings to the question being posed, lookup their original articles, feed them all into context of an LLM and then ask it to answer the question based on that context (alone).

Contexts are becomming quite large so it's possible to put a lot of stuff in there. LLMs answering questions based on a giben text seem to be more reliable than those that are simply trained/fine tuned on some library of texts. p

jahewson · 2024-05-21T22:47:44 1716331664

Unfortunately this still results in plenty of hallucinations.

falcor84 · 2024-05-21T23:43:04 1716334984

What do you mean? Have you or someone else already followed this exact approach?

nestorD · 2024-05-22T00:57:37 1716339457

The approach described above is what is commonly referred to as RAG[0]. I am not aware of someone having used it on Wikipedia but, from experience and while it helps, it does not eliminate all hallucinations.

[0]: https://en.wikipedia.org/wiki/Large_language_model#:~:text=t...

yreg · 2024-05-22T08:44:08 1716367448

Supabase did: https://supabase.com/blog/chatgpt-supabase-docs

I've attempted it as well a year ago (mostly for fun) for our project.

Yes, it can still hallucinate. But I would say it's much much much better in this regard than fine-tuning.

When I did it, the main issue was that our documentation wasn't exhaustive enough. There are plenty of things that are clear to our users (other teams in the company), but not at all clear to the LLM from the few text excerpts it receives. Also, our context was quite limited back then to just a few paragraphs of text.

flipbrad · 2024-05-22T09:16:37 1716369397

Not quite what you're looking for, but something to play around and give feedback on in this area is https://meta.m.wikimedia.org/wiki/Future_Audiences/Experimen...

ImaCake · 2024-05-22T01:08:27 1716340107

You can do this with the copilot chat feature in MS Edge. I just tried to ask it to use only wikipedia and it gave me four references, two of which were wiki. So at least you can get it to spit out references with a bias