Hacker Newsnew | past | comments | ask | show | jobs | submit | roh26it's commentslogin

Looks solid, going to try it out.

I'm going to be __that guy__, but just ask - is the functionality set similar to llamaparse or is this llamaparse + llm?


This does a lot more… the optimization state is incredibly elaborate. And some of the derived document type generation is complex enough to be a standalone app in my opinion. It’s really a suite of tools for generating new types of documents from your original document without any additional input required from the user besides the document itself.


What are the trade-offs you've made to achieve this?


We focused mainly on the scheduling side of things. So we essentially prioritize prefills over decodes. In order to do this correctly, we had to monitor KV cache usage and whenever it's close to running out of memory, we schedule more decodes again.

So this means that you end up either having many decodes wait for prefills to complete or you end up scheduling decodes with prefills. Both scenarios result in slower decodes which is why we're seeing an increase in the ITL. This is the main tradeoff we've made.


So, while time to first token is lower, throughput might also be lower in most cases?


Per user throughput might be lower at the moment yes. We're working on GPU kernel level optimizations now to fix that.

But across all users on our system, the throughput is better because doing more prefills or a large number of grouped decodes has better utilization of the GPU.

The idea is that this works for someone who wants to build a product that is consistent across users in terms of initial response but can trade-off some E2E latency. It ensures that no one is waiting for a long time before getting the first response.


I don’t really get it. Prefill saturates compute and decode saturates memory bandwidth. Why are you not doing mixed batch?


You're totally right and we are doing a mixed batch. What we changed was the priority of performing prefills over decodes.

When looking at a variety of workloads, we realized that prioritizing finishing a query (priotizing decodes) lead to underutilization of the GPU. We noticed there tended to not be enough requests that are concurrently running (because prefill wasn't prioritized) to meaningfully utilize the memory bandwidth with available decodes. This lead to a system that was unfortunately neither compute nor memory bound.

By running mixed batches that prioritize prefills we still compute some decode tokens in our spare capacity, but ensure compute is as saturated as possible. This additionally leads to a buildup of decodes, so that when we are primarily computing decode we're pushing our memory bandwidth as much as we can.

Of course there is still plenty of improvements that can be made on this front. Finding a dynamic balance between prefill and decode that allows us to have both the memory bandwidth and compute being pushed to their limits is the goal from a scheduling perspective. There are a whole host of factors such as the model architecture, input-token:output-token ratio, underlying hardware, KV-cache allocation (and many more) that all play into the pressure placed on memory and compute, so there's definitely still exploration to be done!


The most interesting findings (at least from my perspective) - OpenAI measured "honesty" that could become an interesting model eval others should adopt. - Using instruction heirarchy to _avoid_ prompt injections - When pushed to pursue misaligned goals, it tried to disable oversight in 5% of tests!


I've been thinking about AI guardrails in production for almost 7 months now, and I'm convinced that using them to "control", "secure", or "block" LLM outputs is not the best use for them .

Using them to "guide", "accelerate", and "empower" your AI development process? Now we're talking.

Let me explain why I believe this shift in perspective could change how we build AI...

Most people see AI guardrails as training wheels. But what if we saw them as a nitrous boost for our AI hot rods instead?

When you're not worried about your AI going off the rails, you can push it to its limits. You can experiment with wilder ideas, test bolder hypotheses, and innovate at lightning speed.

Want to train on spicy data? Guardrails have your back. Thinking of an aggressive model architecture? Guardrails keep you in bounds. Controversial prompt? Guardrails ensure you don't cross the line.

Am I in the right direction, or am I just going off the rails?


Also wondering if this would become expensive very fast if it ends up using S3 with a large number of PUT calls


tune the write flush interval down, it'd be very cheap


thanks for the support!


We use a bunch of caching mechanisms on the LLM requests themselves and extend the same to guardrails now.

So there's 2 levels of cache - the LLM request itself might be cached (simple and semantic) and the guardrail response can be cached as well.

We use a mix of a distributed kv store and a vector DB to actually store the data


Being one of the most downloaded datasets on Huggingface, I was a little bit surprised by how dirty this dataset was. Plus it had very limited information and some incorrect classifications as well.

For an internal experiment on building a "Truthful Evaluator", we picked up this dataset and tried fine-tuning a model on these 8000 odd examples.

Realised that it needed: 1. Cleaning up 2. Some reclassification

But, most importantly - it lacked context data. It only had a link pointing to the source which was also absent for a few rows.

We scraped the internet for the link in the dataset, matched it to the question and narrowed down on a small context to be added to the main dataset.

Releasing it publicly so that someone else may avoid the 2-3 days of pain of wrangling with this data.


Here's a mega guide on keeping costs low with LLMs - https://portkey.ai/blog/implementing-frugalgpt-smarter-llm-u...

tl;dr: - Keep prompts short, combine prompts or make more detailed prompts but go to a smaller model - Simple and semantic cache lookups - Classify tasks and route to the best LLM using an AI gateway

Portkey.ai could help with a lot of this


came across this guide earlier - valuable insights. thanks for sharing!


At Portkey, this is a problem we deal with quite a bit. Also the reason that Datadog and the traditional observability vendors did not work for LLM use cases since they're not built to handle large volumes of data.

We've done this through a careful combination of Clickhouse + MinIO for fast retrieval of log items + selected retrieval from the MinIO buckets.

Cost becomes a very big factor when managing, filtering and searching through TBs of data even for fairly small use cases.

One thing we lost in the process is full-text search over the request & response pairs and while we try to intelligently add metadata to requests to make searching easier, it isn't the complete experience yet. Still WIP as a problem statement to solve and maybe the last straw here. Any suggestions?


Clickhouse has text + vector indexes, so that may be native, though we have never used them and I find vector indexes tricky to scale w other DBs. Text... Or neither... may be enough in practice tho as we mostly only care about searching on metadata dimensions like task.

We are thinking about sampled hot data for ops staff in otel DB+UIs, and long-term full data in S3/Clickhouse for custom tooling. It'd be cool if we could send Clickhouse historical otel sessions to grafana etc on demand, but likely a bridge too far...


I think you can (pretty) easily set this up with an otel collector and something that replays data from S3 - there's a native implementation that converts otel to clickhouse


Our scenario would be more like using Clickhouse / a dwh for session cohort/workflow filtering and then populating otel tools for viz goodies. Interestingly, to your point, the otel python exporter libs are pretty simple, so SQL results -> otel spans -> Grafana temp storage should be simple!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: