I have a question about RAG in general (I am quite ignorant regarding LLM and I an trying to reason about possible hurdles before starting experimenting).
I would like to "train" an LLM with a few thousand analysis documents that detail the various enhancements applied to a in-house app over the last twenty years.
My question is: some of the modules that are part of my app have been totally revamped, sometimes more of once.
So while the general requirements for module Foo are more or less consistent, documents talking of it from 2005 to, say 2018 describes either bug fixes or small enhancements.
In 2019 our main Foo provider completely changed their product and therefore the interface, so the 2 docs talking of Foo in 2019 are "more authoritative" than anything before that date... but then COVID happened so we have now Foo 3.0 which was implemented in late 2022 and is now being idly maintained with, again, small enhancements and fixes.
Documents have IDs which include an always increasing number (they start their life as Jira Issues)so just saying "newer=more accurate/valid/authoritative" could help, but I hope we do not need to rank/tag/grade every single document manually in order to assess how much weight it has on any specific topic.
Is this something that needs special treatment or will it just "work"?
I would like to "train" an LLM with a few thousand analysis documents that detail the various enhancements applied to a in-house app over the last twenty years.
My question is: some of the modules that are part of my app have been totally revamped, sometimes more of once. So while the general requirements for module Foo are more or less consistent, documents talking of it from 2005 to, say 2018 describes either bug fixes or small enhancements. In 2019 our main Foo provider completely changed their product and therefore the interface, so the 2 docs talking of Foo in 2019 are "more authoritative" than anything before that date... but then COVID happened so we have now Foo 3.0 which was implemented in late 2022 and is now being idly maintained with, again, small enhancements and fixes.
Documents have IDs which include an always increasing number (they start their life as Jira Issues)so just saying "newer=more accurate/valid/authoritative" could help, but I hope we do not need to rank/tag/grade every single document manually in order to assess how much weight it has on any specific topic.
Is this something that needs special treatment or will it just "work"?