Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

What I'd like to do is create a website where:

1. There is a list of open source fine-tuning datasets on millions of topics. Like, anime, lord of the rings, dnd, customer service responses, finance, code in many programming languages, children's books, religions, philosophies, etc. I mean, on every topic imaginable sort of like a Wikipedia or Reddit of fine-tuning data sets.

2. Users can select one or more available datasets as well as upload their own private datasets

3. Users can turn-key fine-tune llama 2 or other pre-trained models

Right now, doing this kind of thing is way beyond the capability of the common user.



I personally don't see a future where common users will ever have to know the phrase "fine-tuning" or worry about it. The most I can see is "Do you consent to share your information with Apple/Meta/X/Microsoft/OpenAI's knowledge engine?" and if you agree, everything they have on you will power an extremely powerful all-encompassing knowledge engine. Probably with some daily recommendations to integrate a new domain into it, like, "We noticed you're into Lord of the Rings, so we went ahead and made your knowledge engine familiar with the collected works of Tolkein, all historical academic and modern interpretations and criticisms, transcripts of the movies, and generative AI fan fiction capabilities."


I don't think the major barrier to the idea would be consumer awareness. For the near-term the major barrier will be cost. Just as one example, together.ai offers fine-tuning service at an advertised cost of $0.001 per 1k tokens used [1]. That will get pricey for even small datasets. No doubt this will come down, but I don't see consumers paying $1000 for a customized AI model that they then have to pay inference costs to run. Maybe once we get consumer devices that have sufficiently capable AI accelerators (e.g. Apple Neural Engine) to run sufficiently capable llm models, then customers would be willing to customize and run local.

The second point is, we don't know if fine-tuned models, vector search or more-massive general purpose llm models is the right way to go.

But for business-to-business, I think this might be a viable business. If you had a whole bunch of ready-to-go open-source fine-tune datasets for commercial applications you might find a market of businesses that want to run their own models for a variety of reasons.

1. https://together.ai/pricing


The year is 2149, previously, we thought time was the real commoditiy, water before that, and money even prior....

But now. Now. Its DNAX... Cloning fraudulant DNA to make BIO-chips to unlock credits for "yee ol' goods an' services gub'nah"

Basically every transaction is bio-tracked, so if you want an off grid you have to have false clones...

DNA from old embryoes that allow you to build identities in their names and wear them like sleaves to navigate the systems.

This is how you manipulate the engines.


This sounds like a great fit for Cerebras, if they can set up the text database front end.

They could host the text database for free, and then offer a "oh look, you can train llama on this text right now for cheaper than a Nvidia box" button on every listing.

Then charge through the nose for private business training (kinds like they do now, but charging more.)


I agree that it would be almost impossible to defend this kind of business, especially if you stayed committed to open-source datasets. It would come down to the UX and the community if you hoped to survive. Probably long-term you would either have to get into your own pre-trained models, fight the commodity hosting business or aim to get acquired.


Well civitai is basically what you are describing. Its very doable.

But a big difference is that (for now) Stable Diffusion finetuning is much easier than LLaMA.


ELI5 - who exactly makes the open datasets you refer to? [SERIOUS Q]


This would initially be a community, like Wikipedia, Reddit, Github, etc. People who are passionate about the future of AI, believe in the value of open source data and want their voice to be part of a community of data that will be used to train AIs in the future.

In my wildest dreams, and even reasonably, you could incentivize people with a digital currency. I was thinking something along the lines of a community that could stake some money ($100/$1000). They would then get "ownership" and moderating rights to the contents of a dataset. Other people could submit content to their dataset that they could allow or deny. The allowance of the content would distribute some share of the stake in the form of tokens. Then they would be able to re-sell the data in the set to people who want to fine-tune AIs using that dataset. The value of the tokens associated with that dataset would go up thereby distributing some portion of the profit to the moderators and the contributors.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: