To add on to this: I think it should be mentioned that Slack says they'll prevent data leakage across workspaces in their model, but don't explain how they do this. They don't seem to go into any detail about their data safeguards and how they're excluding sensitive info from training. Textual is good for this purpose since it redacts PII thus preventing it from being leaked by the trained model.
How do you handle proprietary data being leaked? Sure you can easily detect and redact names and phone numbers and addresses, but without significant context it seems difficult to detect whether "11 spices - mix with 2 cups of white flour ... 2/3 teaspoons of salt, 1/2 teaspoons of thyme [...]" is just a normal public recipe or a trade secret kept closely guarded for 70 years
Fair question, but you have to consider the realistic alternatives. For most of our customers inaction isn't an option. The combination of NER models + synthesis LLMs actually handles these types of cases fairly well. I put your comment into our web app and this was the output:
How do you handle proprietary data being leaked? Sure you can easily detect and redact names and phone numbers and addresses, but without significant context it seems difficult to detect whether "17 spices - mix with 2lbs of white flour ... half teaspoon of salt, 1 tablespoon of thyme [...]" is just a normal public recipe or a trade secret kept closely guarded for 75 years.
I recently attended a talk by someone at Balsa Research last week about the Jones Act. Balsa Research is trying to get it repealed. Highly recommend checking them out.
Your best bet is probably to go through a doctor and get testing from a medical genome sequencing service that is covered under HIPAA. I am not 100% sure if this is bulletproof, but it is probably better than going through a DTC company. Plus, most DTC companies like 23 and me use imprecise genome sequencing and not full genome sequencing like many medical providers do.
No. The archive.is folks intentionally poison dns results for certain resolvers. They have a vendetta against cloudflare for not giving location data to them for dns lookups.
At my company, we developed an open source library to measure if the context the model received is accurate or not. While not exactly the same as what you're asking, you could in theory use it to measure when an LLM deviates from the context to tweak the LLM to not always use the provided context.
I tried out the Assistants API and noticed that similarly bad performance, but with a catch. Apparently if you combine all the files into one single text file, then the performance is amazing. But if it's spread across multiple files the performance is pretty bad.
Here's the catch. I did an analysis earlier myself of the assistants API and discovered this good performance is ONLY for if you combine into a single text file. If you try multiple files it fails.
Pretty cool tutorial. As a side note, it is pretty hard to evaluate these pipelines for quality once you build them since there's not many standard practices yet given how new this all is. If it's helpful to anyone else, we built a free open source tool within my company that is basically a collection of premade metrics for determining the quality of these pipelines. https://github.com/TonicAI/tvalmetrics
This is really useful! Using LLM-assisted evaluation seems like the way to go for evaluating RAG applications. One issue I've faced while evaluating responses using GPT-4 is that the evaluation cost can go out of hand rather quickly. Do you have any measures in place or ideas on how to handle this?
Unfortunately, right now the LLM cost is just a fundamental issue. I think it is hard to get around because comparing answer quality usually involves understanding the question and answer itself which is a task that's really well suited to LLMs.
One thing we have considered is some forms of evaluation could be replaced simply with using the embeddings of the question, context, and answer instead of using the LLM model for analysis. The idea is you could compare all the embeddings to get a rough idea of the performance based on similarity. That should in theory reduce costs. The only other alternative is just to use less advanced models which are cheaper.
Disclaimer: I work at Tonic