"""
Task: Divide the provided text into semantically coherent chunks, each containing between 250-350 words. Aim to preserve logical and thematic continuity within each chunk, ensuring that sentences or ideas that belong together are not split across different chunks.
Guidelines:
1. Identify natural text breaks such as paragraph ends or section divides to initiate new chunks.
2. Estimate the word count as you include content in a chunk. Begin a new chunk when you reach approximately 250 words, preferring to end on a natural break close to this count, without exceeding 350 words.
3. In cases where text does not neatly fit within these constraints, prioritize maintaining the integrity of ideas and sentences over strict adherence to word limits.
4. Adjust the boundaries iteratively, refining your initial segmentation based on semantic coherence and word count guidelines.
Your primary goal is to minimize disruption to the logical flow of content across chunks, even if slight deviations from the word count range are necessary to achieve this.
"""
Might sound like a rookie question, but curious how you'd tackle semantic chunking for a hefty text, like a 100k-word book, especially with phi-2's 2048 token limit [0]. Found some hints about stretching this to 8k tokens [1] but still scratching my head on handling the whole book. And even if we get the 100k words in, how do we smartly chunk the output into manageable 250-350 word bits? Is there a cap on how much the output can handle? From what I've picked up, a neat summary ratio for a large text without missing the good parts is about 10%, which translates to around 7.5K words or over 20 chunks for the output. Appreciate any insights here, and apologies if this comes off as basic.
Wild speculation - do you think there could be any benefit from creating two sets of chunks with one set at a different offset from the first? So like, the boundary between chunks in the first set would be near the middle of a chunk in the second set?
No, it's better to just create summaries of all the chunks, and return summaries of chunks that are adjacent to chunks that are being retrieved. That gives you edge context without the duplication. Having 50% duplicated chunks is just going to burn context, or force you to do more pre-processing of your context.
This just isn't working for me, phi-2 starts summarizing the document I'm giving it. I tried a few news articles and blog posts. Does using a GGUF version make a difference?
Depending on the number of bits in the quantization, for sure. The most common failure mode should be minor restatements which you can choose to ignore or not.
That looks like it'd be an adjunct strategy IMO. In most cases you want to have the original source material on tap, it helps with explainability and citations.
That being said, it seems that everyone working at the state of the art is thinking about using LLMs to summarize chunks, and summarize groups of chunks in a hierarchical manner. RAPTOR (https://arxiv.org/html/2401.18059v1) was just published and is close to SoTA, and from a quick read I can already think of several directions to improve it, and that's not to brag but more to say how fertile the field is.
Whether or not it follows the instructions as written, it produces good output as long as the chunk size stays on the smaller side. You can validate that all the original text is present in the chunks and that no additional text has been inserted easily enough and automatically re-prompt.
""" Task: Divide the provided text into semantically coherent chunks, each containing between 250-350 words. Aim to preserve logical and thematic continuity within each chunk, ensuring that sentences or ideas that belong together are not split across different chunks.
Guidelines: 1. Identify natural text breaks such as paragraph ends or section divides to initiate new chunks. 2. Estimate the word count as you include content in a chunk. Begin a new chunk when you reach approximately 250 words, preferring to end on a natural break close to this count, without exceeding 350 words. 3. In cases where text does not neatly fit within these constraints, prioritize maintaining the integrity of ideas and sentences over strict adherence to word limits. 4. Adjust the boundaries iteratively, refining your initial segmentation based on semantic coherence and word count guidelines.
Your primary goal is to minimize disruption to the logical flow of content across chunks, even if slight deviations from the word count range are necessary to achieve this. """