Good catch. That's likely an artifact of the way I flatten the nested JSON from ...

acka · 2025-02-01T10:47:31 1738406851

This is why I keep advocating that JSON should die, or at least no longer be used with LLMs. LLMs (and human brains) are simply not wired up for counting nested curly brackets across long spans of quoted text joined by colons and commas, and it is far too easy for humans to make mistakes when chunking JSON.

IMO, (Strict)YAML is a very good alternative, it has even been suggested to me by multiple LLMs when I asked them what they thought the best format for presenting conversations to an LLM would be. It is very easy to chunk simple YAML and present it to an LLM directly off the wire: you only need to remember to repeat the indentation and names of all higher level keys (properties) pertaining to the current chunk at the top of the chunk, then start a text block containing the remaining text in the chunk, and the LLM will happily take it from there:

    topic:
      subtopic:
        text: |
          Subtopic text for this chunk.

If you want to make sure that the LLM understands that it is dealing with chunks of a larger body of text, you can start and end the text blocks of the chunks with an ellipsis ('...').

inciampati · 2025-02-01T15:18:29 1738423109

LLMs (transformers) literally cannot balance parentheses. That's outside of their complexity class (TC0). You'd want a real UTM to count parentheses!

m0rde · 2025-02-01T16:29:02 1738427342

Would you elaborate on why counting braces is different from counting spaces to determine hierarchy? Or is it more about the repetition of higher levels keys in chunks (which could be done in JSON)?

acka · 2025-02-02T16:28:10 1738513690

Repetition of topics and subtopics is by far the most important part, reinforcing attention on the topic at hand even if the text in the chunk appears unrelated to the topic when viewed in isolation.

Keeping the indentation is also important because it is an implicit and repeated indication of the nesting level of the content that follows. LLMs have trouble with balancing nested parentheses (as the sibling comment to yours explains).

Dealing with text where indentation matters is easier for LLMs, and because they have been exposed to large amounts of it (such as Python code and lists of bullet points) during training, they have learned to handle this quite well.