Hacker Newsnew | past | comments | ask | show | jobs | submit | Ephil012's commentslogin

To add on to this: I think it should be mentioned that Slack says they'll prevent data leakage across workspaces in their model, but don't explain how they do this. They don't seem to go into any detail about their data safeguards and how they're excluding sensitive info from training. Textual is good for this purpose since it redacts PII thus preventing it from being leaked by the trained model.

Disclaimer: I work at Tonic


How do you handle proprietary data being leaked? Sure you can easily detect and redact names and phone numbers and addresses, but without significant context it seems difficult to detect whether "11 spices - mix with 2 cups of white flour ... 2/3 teaspoons of salt, 1/2 teaspoons of thyme [...]" is just a normal public recipe or a trade secret kept closely guarded for 70 years


Fair question, but you have to consider the realistic alternatives. For most of our customers inaction isn't an option. The combination of NER models + synthesis LLMs actually handles these types of cases fairly well. I put your comment into our web app and this was the output:

How do you handle proprietary data being leaked? Sure you can easily detect and redact names and phone numbers and addresses, but without significant context it seems difficult to detect whether "17 spices - mix with 2lbs of white flour ... half teaspoon of salt, 1 tablespoon of thyme [...]" is just a normal public recipe or a trade secret kept closely guarded for 75 years.


I recently attended a talk by someone at Balsa Research last week about the Jones Act. Balsa Research is trying to get it repealed. Highly recommend checking them out.

https://www.balsaresearch.com/


Your best bet is probably to go through a doctor and get testing from a medical genome sequencing service that is covered under HIPAA. I am not 100% sure if this is bulletproof, but it is probably better than going through a DTC company. Plus, most DTC companies like 23 and me use imprecise genome sequencing and not full genome sequencing like many medical providers do.


So 23&me isn't even like the gold standard re:genomic analysis/testing? They're basically just the Dell of testing?


Dbrand Sues Casetify


AP didn’t provide a link to the official campaign page I think. Here’s the link https://www.restauracionecologica.org/adopciones


Out of curiosity, why do you say the bit about not using CloudFlare's DNS? Is using it incompatible with archive.is?


Yeah, archive.is doesn't like certain DNS resolvers.


Is it the other way around: certain DNS resolvers don't like archive.is?


No. The archive.is folks intentionally poison dns results for certain resolvers. They have a vendetta against cloudflare for not giving location data to them for dns lookups.

https://jarv.is/notes/cloudflare-dns-archive-is-blocked/


Wow, what a shame.


At my company, we developed an open source library to measure if the context the model received is accurate or not. While not exactly the same as what you're asking, you could in theory use it to measure when an LLM deviates from the context to tweak the LLM to not always use the provided context.

Shameless plug for the library: https://github.com/TonicAI/tvalmetrics


I tried out the Assistants API and noticed that similarly bad performance, but with a catch. Apparently if you combine all the files into one single text file, then the performance is amazing. But if it's spread across multiple files the performance is pretty bad.

Analysis here if anyone is curious https://news.ycombinator.com/item?id=38280718


Here's the catch. I did an analysis earlier myself of the assistants API and discovered this good performance is ONLY for if you combine into a single text file. If you try multiple files it fails.

Here's my post with the analysis. https://news.ycombinator.com/item?id=38280718


Pretty cool tutorial. As a side note, it is pretty hard to evaluate these pipelines for quality once you build them since there's not many standard practices yet given how new this all is. If it's helpful to anyone else, we built a free open source tool within my company that is basically a collection of premade metrics for determining the quality of these pipelines. https://github.com/TonicAI/tvalmetrics


This is really useful! Using LLM-assisted evaluation seems like the way to go for evaluating RAG applications. One issue I've faced while evaluating responses using GPT-4 is that the evaluation cost can go out of hand rather quickly. Do you have any measures in place or ideas on how to handle this?


Unfortunately, right now the LLM cost is just a fundamental issue. I think it is hard to get around because comparing answer quality usually involves understanding the question and answer itself which is a task that's really well suited to LLMs.

One thing we have considered is some forms of evaluation could be replaced simply with using the embeddings of the question, context, and answer instead of using the LLM model for analysis. The idea is you could compare all the embeddings to get a rough idea of the performance based on similarity. That should in theory reduce costs. The only other alternative is just to use less advanced models which are cheaper.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: