Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Good catch. That's likely an artifact of the way I flatten the nested JSON from the comments API.

I originally did that to save on tokens but modern models have much larger input windows so I may not need to do that any more.



This is why I keep advocating that JSON should die, or at least no longer be used with LLMs. LLMs (and human brains) are simply not wired up for counting nested curly brackets across long spans of quoted text joined by colons and commas, and it is far too easy for humans to make mistakes when chunking JSON.

IMO, (Strict)YAML is a very good alternative, it has even been suggested to me by multiple LLMs when I asked them what they thought the best format for presenting conversations to an LLM would be. It is very easy to chunk simple YAML and present it to an LLM directly off the wire: you only need to remember to repeat the indentation and names of all higher level keys (properties) pertaining to the current chunk at the top of the chunk, then start a text block containing the remaining text in the chunk, and the LLM will happily take it from there:

    topic:
      subtopic:
        text: |
          Subtopic text for this chunk.
If you want to make sure that the LLM understands that it is dealing with chunks of a larger body of text, you can start and end the text blocks of the chunks with an ellipsis ('...').


LLMs (transformers) literally cannot balance parentheses. That's outside of their complexity class (TC0). You'd want a real UTM to count parentheses!


Would you elaborate on why counting braces is different from counting spaces to determine hierarchy? Or is it more about the repetition of higher levels keys in chunks (which could be done in JSON)?


Repetition of topics and subtopics is by far the most important part, reinforcing attention on the topic at hand even if the text in the chunk appears unrelated to the topic when viewed in isolation.

Keeping the indentation is also important because it is an implicit and repeated indication of the nesting level of the content that follows. LLMs have trouble with balancing nested parentheses (as the sibling comment to yours explains).

Dealing with text where indentation matters is easier for LLMs, and because they have been exposed to large amounts of it (such as Python code and lists of bullet points) during training, they have learned to handle this quite well.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: