More

Ephil012 · on May 16, 2024

To add on to this: I think it should be mentioned that Slack says they'll prevent data leakage across workspaces in their model, but don't explain how they do this. They don't seem to go into any detail about their data safeguards and how they're excluding sensitive info from training. Textual is good for this purpose since it redacts PII thus preventing it from being leaked by the trained model.

Disclaimer: I work at Tonic

a2128 · on May 17, 2024

How do you handle proprietary data being leaked? Sure you can easily detect and redact names and phone numbers and addresses, but without significant context it seems difficult to detect whether "11 spices - mix with 2 cups of white flour ... 2/3 teaspoons of salt, 1/2 teaspoons of thyme [...]" is just a normal public recipe or a trade secret kept closely guarded for 70 years

icoe · on May 17, 2024

Fair question, but you have to consider the realistic alternatives. For most of our customers inaction isn't an option. The combination of NER models + synthesis LLMs actually handles these types of cases fairly well. I put your comment into our web app and this was the output:

How do you handle proprietary data being leaked? Sure you can easily detect and redact names and phone numbers and addresses, but without significant context it seems difficult to detect whether "17 spices - mix with 2lbs of white flour ... half teaspoon of salt, 1 tablespoon of thyme [...]" is just a normal public recipe or a trade secret kept closely guarded for 75 years.

Ephil012 · on Jan 31, 2024

I recently attended a talk by someone at Balsa Research last week about the Jones Act. Balsa Research is trying to get it repealed. Highly recommend checking them out.

https://www.balsaresearch.com/

Ephil012 · on Dec 7, 2023

Your best bet is probably to go through a doctor and get testing from a medical genome sequencing service that is covered under HIPAA. I am not 100% sure if this is bulletproof, but it is probably better than going through a DTC company. Plus, most DTC companies like 23 and me use imprecise genome sequencing and not full genome sequencing like many medical providers do.

Obscurity4340 · on Dec 8, 2023

So 23&me isn't even like the gold standard re:genomic analysis/testing? They're basically just the Dell of testing?

Ephil012 · on Nov 25, 2023

Dbrand Sues Casetify

Ephil012 · on Nov 25, 2023

AP didn’t provide a link to the official campaign page I think. Here’s the link https://www.restauracionecologica.org/adopciones

Ephil012 · on Nov 19, 2023

Out of curiosity, why do you say the bit about not using CloudFlare's DNS? Is using it incompatible with archive.is?

wmf · on Nov 19, 2023

Yeah, archive.is doesn't like certain DNS resolvers.

zx8080 · on Nov 19, 2023

Is it the other way around: certain DNS resolvers don't like archive.is?

upon_drumhead · on Nov 19, 2023

No. The archive.is folks intentionally poison dns results for certain resolvers. They have a vendetta against cloudflare for not giving location data to them for dns lookups.

https://jarv.is/notes/cloudflare-dns-archive-is-blocked/

zx8080 · on Nov 19, 2023

Wow, what a shame.

Ephil012 · on Nov 18, 2023

At my company, we developed an open source library to measure if the context the model received is accurate or not. While not exactly the same as what you're asking, you could in theory use it to measure when an LLM deviates from the context to tweak the LLM to not always use the provided context.

Shameless plug for the library: https://github.com/TonicAI/tvalmetrics

Ephil012 · on Nov 15, 2023

I tried out the Assistants API and noticed that similarly bad performance, but with a catch. Apparently if you combine all the files into one single text file, then the performance is amazing. But if it's spread across multiple files the performance is pretty bad.

Analysis here if anyone is curious https://news.ycombinator.com/item?id=38280718

Ephil012 · on Nov 15, 2023

Here's the catch. I did an analysis earlier myself of the assistants API and discovered this good performance is ONLY for if you combine into a single text file. If you try multiple files it fails.

Here's my post with the analysis. https://news.ycombinator.com/item?id=38280718

Ephil012 · on Oct 28, 2023

Pretty cool tutorial. As a side note, it is pretty hard to evaluate these pipelines for quality once you build them since there's not many standard practices yet given how new this all is. If it's helpful to anyone else, we built a free open source tool within my company that is basically a collection of premade metrics for determining the quality of these pipelines. https://github.com/TonicAI/tvalmetrics

pchunduri6 · on Oct 29, 2023

This is really useful! Using LLM-assisted evaluation seems like the way to go for evaluating RAG applications. One issue I've faced while evaluating responses using GPT-4 is that the evaluation cost can go out of hand rather quickly. Do you have any measures in place or ideas on how to handle this?

Ephil012 · on Oct 31, 2023

Unfortunately, right now the LLM cost is just a fundamental issue. I think it is hard to get around because comparing answer quality usually involves understanding the question and answer itself which is a task that's really well suited to LLMs.

One thing we have considered is some forms of evaluation could be replaced simply with using the embeddings of the question, context, and answer instead of using the LLM model for analysis. The idea is you could compare all the embeddings to get a rough idea of the performance based on similarity. That should in theory reduce costs. The only other alternative is just to use less advanced models which are cheaper.