More

arturventura · 2024-10-27T22:38:04 1730068684

I've been working on the idea of building synthetic workers. I'm trying to implement a planning workflow system for scenarios where the workflow definition, the environment, or the task are not well defined. I also ended up implementing a micro Palentir plugin system to support the action system for the synthetic users.

Its a cool project that gave me immense pleasure to built, however its unfortunately a intellectually masturbatory one, because although the tech is cool, I haven't found a cool application for it. If anyone is interested hit me up.

arturventura · on Feb 4, 2024

What the actual f**!!! I'm from Coruche!! The fact that I'm seeing my hometown name in HN is f'ing insane.

zemvpferreira · on Feb 4, 2024

Incrível! Eu sou de Lisboa, mas tenho laços familiares a Coruche.

arturventura · on Jan 2, 2024

  Location: Lisbon, Portugal
  Remote: Yes
  Willing to relocate: Maybe
  Technologies: Django, Python, Vue, C, Java, JavaScript, Typescript, PyTorch, Nordic Firmware Development
  Résumé/CV: https://www.surf-the-edge.com/wp-content/uploads/2024/01/Resume.pdf
  Email: artur.ventura@gmail.com
  Website: https://www.surf-the-edge.com/

My name is Artur Ventura, I'm a Senior Software Developer Engineer with a cross functional expertise in every level of software development. Developing software on a professional level, in scenarios of hundreds of thousands of users for 13 years. Focused on Applied Artificial Intelligence, in particular Natural Language Processing pipelines. Just finishing the sale of a company and looking for a new challenge.

arturventura · on Feb 9, 2023

Interesting idea, but if you didn't filter the elements with URLs, you're creating a CSS injection attack that enables people who visit the website to work as bots in a botnet and attack some url.

DanielBark · on Feb 10, 2023

I did :)

arturventura · on Jan 11, 2023

This is really good, and I was really excited by it but then I read:

> running on a single 8XA100 40GB node in 38 hours of training

This is a $40-80k machine. Not a diss, but I would love to see an advance that would allow anyone with a high end computer to be able to improve on this model. Before that happens this whole field is going to be owned by big corporations.

pavlov · on Jan 11, 2023

I don't know if that's a blocker. Ordinary people commonly rent a $40k machine for 38 hours from companies like Avis and Hertz.

If training a large model now costs the same as driving to visit grandma, that seems like a pretty good deal.

jetrink · on Jan 11, 2023

That's a great comparison. For a real number, I just checked Runpod and you can rent a system with 8xA100 for $17/hr or ~$700 for 38 hours. Not cheap, but also pretty close to the cost of renting a premium vehicle for a few days. I've trained a few small models by renting an 1xA5000 system and that only costs $0.44/hr, which is perfect for learning and experimentation.

amelius · on Jan 11, 2023

It would be great if a tradeoff could be made, though. For example, train at 1/10th the speed for 1/10th of the cost.

This could correspond to taking public transport in your analogy, and would bring this within reach of most students.

londons_explore · on Jan 11, 2023

Slower training tends to be only a little cheaper, because most modern architectures parallelize well, and they just care about the number of flops.

If you want to reduce cost, you need to reduce the model size, and you'll get worse results for less money.

mk_stjames · on Jan 11, 2023

The problem with that is currently, the available memory scales with the class of GPU.... and very large language models need 160-320GB of VRAM. So, there sadly isn't anything out there that you can load up a model this large on except a rack of 8x+ A40s/A100s.

I know there are memory channel bandwidth limits and whatnot but I really wish there was a card out there with a 3090 sized die but with 96GB of VRAM solely to make it easier to experiment with larger models. If it takes 8 days to train vs. 1, thats fine. having only two of them to get 192GB and still fit on a desk and draw normal power would be great.

buildbot · on Jan 11, 2023

Technically this is not true- there are a lot of techniques to shard models and store activation between layers or even smaller subcomponents of the network. For example, you can split the 175B parameter bloom model into separate layers, load up a layer, read the prev. layers input from disk, and save the output to disk.

And NVIDIA does make cards like you are asking for - the A100 is the fast memory offering, the A40 the bulk slower memory (though they added the 80GB A100 and did not double the A40 to 96GB so this is less true now than the P40 vs P100 gen).

Oddly, you can get close to what you are asking for with a M1 Mac Studio - 128GB of decently fast memory with a GPU that is ~0.5x a 3090 in training.

sbrother · on Jan 12, 2023

Do you know if there's any work on peer-to-peer clustering of GPU resources over the internet? Imagine a few hundred people with 1-4 3080Tis each, running software that lets them form a cluster large enough to train and/or run a number of LLMs. Obviously the latency between shards would be orders of magnitude higher than a colocated cluster, but I wonder if that could be designed around?

pizza · on Jan 12, 2023

Bloom-petals

sbrother · on Jan 12, 2023

Amazing. Thank you.

pizza · on Jan 13, 2023

No prob. I think it’s a great idea

amelius · on Jan 11, 2023

I guess this would only become a reality if games started requiring these cards.

mcbuilder · on Jan 11, 2023

Well if it used to cost you $1 for 1hr at 1x speed, now it will take you 10hr at 0.1x speed, and if my math checks out $1. You need to shrink the model.

amelius · on Jan 11, 2023

But of course now you run it on your own computer instead of in the DC, which changes the numbers. Especially if your student dorm has a shared electricity bill :)

willseth · on Jan 11, 2023

The good news is that, unlike vehicles, the rate for rented compute will continue to drop

Apofis · on Jan 11, 2023

Let's not forget that rendering 3D Animations in 3DSMAX or Maya used to take days for a single frame for a complex scene, and months for a few minutes.

swader999 · on Jan 11, 2023

You have to gas it up and heaven help you if it gets a scratch or a scuff.

speed_spread · on Jan 11, 2023

Great news! Cloud instances energy usage is included in their price, and because they're remote and transient it's impossible to permanently damage them.

aequitas · on Jan 11, 2023

I think the equivalent of being not careful and getting a dent in this context is to leave it open to the internet and having a bitcoin miner installed.

Aissen · on Jan 11, 2023

You free the instance and the miner is gone.

iso1631 · on Jan 11, 2023

As you are paying for the resources you use that's fine.

The closest would be if you used some form of software bug to actually cause physical damage, certainly not impossible, but extremely unlikely compared with actually physically damaging a car.

idonotknowwhy · on Jan 11, 2023

A better fit would be, if you have unlimited liability like with AWS, and you leak your key pair. Then someone runs up a 100k bill setting up mining instances

DesiLurker · on Jan 11, 2023

but you still have to pay for network ingress/egress traffic.

ofcourseyoudo · on Jan 11, 2023

Similarly maybe we should only let people rent a NanoGPT box if they are over 25 and they have to get collision insurance.

Tepix · on Jan 11, 2023

If you can fit the training into 24GB, a used RTX 3090 for $700-$800 seems like a good deal at the moment. They are about 45-65% as fast as the A100 according to https://bizon-tech.com/gpu-benchmarks/NVIDIA-RTX-3090-vs-NVI...

So if you buy two of these cards it will take 12-13 days instead of 38 hours but only require a $2500 PC.

James Betker, who created tortoise TTS, built his own $15k machine with 8x RTX 3090 and trained the models with it. He now works for OpenAI…

Tepix · on Jan 25, 2023

Recommended reading:

https://timdettmers.com/2023/01/16/which-gpu-for-deep-learni...

TL;DR: You probably don't need that expensive Threadripper because 2x PCIe 4.0 x16 will not be very beneficial. Go cheap, go 2x PCIe 4.0 x8.

klaudioz · on Jan 12, 2023

Any link to the 15k machine ?. Maybe it is cheaper now.

Tepix · on Jan 12, 2023

I think it was a DIY machine, those RTX 3090 have gotten cheaper for sure. From my experience, going beyond 4 GPUs is a pricey affair. See [§]. All but one model of the RTX3090 require at least 3 slots.

If 4 GPUs connected via PCIe 4.0x16 are enough you can choose among various sRTX4 boards for 3000 series AMD Threadripper CPUs.

[§] https://www.reddit.com/r/deeplearning/comments/tw0olq/commen...

Another useful URL: https://www.pugetsystems.com/labs/articles/Quad-GeForce-RTX-...

wongarsu · on Jan 11, 2023

It's a $33/hour machine on AWS, so about $1250 for one training run. Not cheap, but easily in the reach of startups and educational or research institutions.

Edit: or about $340 if you get the 8xA100 instance from lambdalabs, in the realm of normal hobby spending

belter · on Jan 11, 2023

Or $9/hour if you use Spot :-)

https://aws.amazon.com/ec2/spot/pricing/

snerbles · on Jan 11, 2023

Hopefully your progress gets saved in time when the spot instance inevitably gets terminated in the midst of training.

belter · on Jan 11, 2023

"Managed Spot Training..."

"...Spot instances can be interrupted, causing jobs to take longer to start or finish. You can configure your managed spot training job to use checkpoints. SageMaker copies checkpoint data from a local path to Amazon S3. When the job is restarted, SageMaker copies the data from Amazon S3 back into the local path. The training job can then resume from the last checkpoint instead of restarting...."

https://docs.aws.amazon.com/sagemaker/latest/dg/model-manage...

acetabulum · on Jan 11, 2023

If you use Horovod Elastic, I think you can avoid this problem working across a cluster of Spot instances.

https://horovod.readthedocs.io/en/stable/elastic_include.htm...

bobbyi · on Jan 11, 2023

If you're doing something new/ custom (which you presumably are if you aren't using someone else's prebuilt model), it could take a lot of runs to figure out the best training data and finetune settings.

(I assume. I've never worked with GPT, but have done similar work in other domains).

weird-eye-issue · on Jan 11, 2023

After training don't you have to keep it running if you want to use it?

wongarsu · on Jan 11, 2023

Just download the model and run it on something much smaller and cheaper. Bigger models like GPT-J are a bit of a pain to run, but GPT2-sized models run just fine on consumer GPUs.

weird-eye-issue · on Jan 12, 2023

Ahh okay, thanks. So how big is the model? Seems like it should be available to download so people don't have to train it. I understand you can train it on custom data but for a "default" model are there any available to download?

bilsbie · on Jan 11, 2023

What’s required to run the model?

wongarsu · on Jan 11, 2023

The biggest GPT2 (1.5B params) takes about 10GB VRAM, meaning it runs on a RTX 2080 TI, or the 12GB version of the RTX 3080

renewiltord · on Jan 11, 2023

What's the largest language model I can run on a 3090 with 24 GiB RAM?

lossolo · on Jan 11, 2023

Depends on precision, you can run ~5B model with fp32 precision or ~11B fp16 model max. Int8 is really bad for real world use case so not mentioning it.

But if you are looking to get performance of ChatGPT or GPT-3 then don't waste your time, all GPT-3 like small LLM models (below at least 60B params) are useless for any real world use case, they are just toys.

haldujai · on Jan 11, 2023

If you specifically mean a general LLM trained on a general language corpus with instruction finetuning this is correct.

Fortunately very few real world use cases need to be this general.

If you are training a LLM on a domain specific corpus or finetuning on specific downstream tasks even relatively tiny models at 330m params are definitely useful and not “toys” and can be used to accurately perform tasks such as semantic text search, document summarization and named entity recognition.

lossolo · on Jan 11, 2023

> If you specifically mean a general LLM trained on a general language corpus with instruction finetuning this is correct.

Yes, thanks, that's what I meant.

> If you are training a LLM on a domain specific corpus or finetuning on specific downstream tasks even relatively tiny models at 330m params are definitely useful and not “toys” and can be used to accurately perform tasks such as semantic text search, document summarization and named entity recognition.

Agree, BERT family is a good example here.

renewiltord · on Jan 11, 2023

Okay, thank you. Perfect response.

JustSomeNobody · on Jan 11, 2023

https://github.com/karpathy/nanoGPT#i-only-have-a-macbook

> This creates a much smaller Transformer (4 layers, 4 heads, 64 embedding size), runs only on CPU, does not torch.compile the model (torch seems to give an error if you try), only evaluates for one iteration so you can see the training loop at work immediately, and also makes sure the context length is much smaller (e.g. 64 tokens), and the batch size is reduced to 8. On my MacBook Air (M1) this takes about 400ms per iteration. The network is still pretty expensive because the current vocabulary is hard-coded to be the GPT-2 BPE encodings of vocab_size=50257. So the embeddings table and the last layer are still massive. In the future I may modify the code to support simple character-level encoding, in which case this would fly. (The required changes would actually be pretty minimal, TODO)

windexh8er · on Jan 11, 2023

But how often do you need to run this? You can run 8xA1000 on LambdaLabs [0] (no affiliation) for $8.80/hr. So you should be able to run the entire data set for less than $350.

[0] https://lambdalabs.com/service/gpu-cloud#pricing

throwawaymaths · on Jan 11, 2023

They are acknowledged at the bottom for supporting andrej's research!!

jph00 · on Jan 11, 2023

A couple of weeks ago a new paper came out that shows how to train a high quality language model on a single GPU in one day.

https://arxiv.org/abs/2212.14034

haldujai · on Jan 11, 2023

If you can’t fit the model on your resources you can leverage DeepSpeed’s ZeRO-offload which will let you train GPT2 on a single V100 (32gb).

Alternatively, if you’re researching (with the caveat that you have to either publish, open source or share your results in a blog post) you can also get access to Google’s TPU research cloud which gives you a few v3-8s for 30 days (can’t do distributed training on devices but can run workloads in parallel). You can also ask nicely for a pod, I’ve been granted access to a v3-32 for 14 days pretty trivially which (if optimized) has more throughput than 8xA100 on transformer models.

TPUs and moreso pods are a bit harder to work with and TF performs far better than PyTorch on them.

https://www.deepspeed.ai/tutorials/zero-offload/

https://medium.com/analytics-vidhya/googles-tpu-research-clo...

dceddia · on Jan 11, 2023

I was curious about how much this would be to rent, because definitely the cost of those servers is outside the budget! Lambda has 8xA100 40gb for $8.80/hr: https://lambdalabs.com/service/gpu-cloud#pricing

Tenoke · on Jan 11, 2023

It seems as likely as people being able to build big automaker level of cars just with tools in their garage. More compute is going to keep producing better results at least for LLMs.

kzrdude · on Jan 11, 2023

How are universities and colleges dealing with this kind of demand for computing power? It must be hard to be able to do some courses now.

CuriouslyC · on Jan 11, 2023

Most decently large colleges have been investing in HPC for a while, and started investing in GPU HPC around 2014. You'd be surprised what sort of school projects the compute budget exists for.

r3trohack3r · on Jan 11, 2023

I went to a smallish state university, even there we had our own HPC center and lab. We had a proper HPC (IIRC) 6 row data center across campus and we had a continuous budget available to me as an undergraduate research assistant for building beowulf clusters for the graduate programs to run assignments on. I once got an allowance to buy 15 raspberry pis to build an arm cluster.

TrackerFF · on Jan 11, 2023

As far as research groups go - they get funds (project grants, donations, etc.) to purchase machines and parts, and then users have to timeshare them.

These machines are pretty much crunching numbers 24/7, and your project will get appended to a queue.

londons_explore · on Jan 11, 2023

'group project'

ProjectArcturis · on Jan 11, 2023

That's to train it from scratch, though, right? If you preload the GPT2 weights you don't need to do this. You can just give it additional training on your texts.

anigbrowl · on Jan 11, 2023

Well, he does include instructions for running it on a personal computer, which looks like what I'm gonna be doing next week.

Besides the rental options discussed below these nvidia boxen don't look too big so either used ones will be available for cheap relatively soon, or you could just locate and liberate one in Promethean fashion.

anilshanbhag · on Jan 11, 2023

If GPT-2 / nanoGPT needs this setup, just imagine what GPT3 / chatGPT needs!

Gigachad · on Jan 11, 2023

Supposedly even running the trained model for ChatGPT is extremely expensive unlike the image generators which can largely be run on a consumer device.

aidos · on Jan 11, 2023

I don’t know anything about this, but is that this instance type on AWS? p4d.24xlarge

base698 · on Jan 11, 2023

You can rent on AWS and other cloud providers.

krisoft · on Jan 11, 2023

So if I see it right that would be a p4d.24xlarge instance. Which goes for about $32.77 an hour nowadays so the total training would be about $1245. Not cheap, but certainly not a nation state budget.

Edit: i just noticed lambda lab. It seems they ask $8.8 per hour for an instance of this caliber. That puts the total training cost around $334. I wonder how come it is that much cheaper.

liquidk · on Jan 11, 2023

That is a key difference. You can’t easily and cheaply rent an auto factory, but you’re starting to be able to rent an LLM training factory once for a model where you can then more cheaply run inference on.

arturventura · on Sept 9, 2022

Hey there!

Search in the web is becoming problematic. Quality is decreasing and competition is extremely hard.

Some time ago I came to the realisation that the biggest strength of Google might also be its Achilles heel. Google is forced to create a list of links because that's the main vehicle where they drive profit from. If you were to send a question like "Who is Barack Obama?" you still will get a list of links although google knows there is a canonical answer.

However, if you were to build a new search engine from the ground up you would need to build the infrastructure to crawl the web, crawl it, index it and build the interface. That will take a lot of money and time to test one idea. And there are multiple possible attack vectors to Google's business model (privacy, subscription model, modality, etc.). You might get the chance of testing one of them, and if that fails, starting again is super expensive so you might not be able to.

My idea is to have a single open web index database, continuously updated so that you can apply ranking and embedding algorithms to it. This would reduce the cost of entry, and enable developers to build competitors to google on top of it, or create new products in the search space (for instance, a search engine for clothes). I don't know if this is interesting for anyone but if it is, hit me up.

arturventura · on Aug 25, 2022

I'm working on building an AWS for anyone who wants to make their own search engine. The idea is to have a single open webindex database, continuously updated that you can apply ranking and embedding algorithms in it. This would reduce the cost of entry, and enable developers to build competitors to google on top of it, or create new products in the search space like a search engine for clothes. I don't know if this is interesting for anyone but if it is, hit me up.

mdaniel · on Aug 25, 2022

That sounds very cool, and I hope you (and your customers!) are successful. Out of curiosity, did you find an existing market need for that, or it's a "build it and they will come" model?

Also, have you thought about partnering with commoncrawl.org? I could see that relationship benefiting both sides: they get fresher indices, you get access to the historical web snaps

arturventura · on Aug 25, 2022

I faced the problem. I think one of the main issues with google is the modality of the results. Google is forced to create a list of links because that's the main vehicle where they drive profit. If you were to send a question like "Who is Barack Obama?" you still will get a list of links although google knows there is a canonical answer.

The problem is that if you were to build a new search engine from the ground up it will take millions in infrastructure, and a lot of time for you to test one idea. And there are multiple attack vectors to Google's business model (privacy, subscription model, modality, etc.) however you might get the change of testing one of them, and if that fails, starting again is super expensive so you might not be able to get funds to do it.

My approach then became to build something that others can build on top of.

I'm currently using common crawl but my main problem is that I need to build a small toy to test it and even processing common crawl is crazy expensive. Just a single snap are 150 Tb, so this needs to be process on metal, or you're gonna pay a hefty AWS bill.

johannes1234321 · on Aug 25, 2022

> If you were to send a question like "Who is Barack Obama?" you still will get a list of links although google knows there is a canonical answer.

For that specific search I would start at Wikipedia, but for more general "data search" I lean towards Wolfram Alpha, which has some usability issues, but interesting maths engine for queries. https://www.wolframalpha.com/input?i=Barack+Obama+vs+Donald+...

tambourine_man · on Aug 25, 2022

It sounds like just what need to break free from Google.

I’ve been dreaming of an open web index and social graph for more than a decade.

Any company having the data + the algorithm + the presentation layer is way too much power. We can and should split that problem into its separate domains.

I hope you succeed, keep us posted.

pkghost · on Aug 25, 2022

Have you seen Common Crawl? https://commoncrawl.org/. If so, what differences do you imagine for yours?

mdaniel · on Aug 25, 2022

> continuously updated

is what I saw as the primary difference. Whether that's going to pan out in reality as well as it does in HN comments is "the devil's in the details" though

rohit89 · on Aug 25, 2022

Wouldn't it be prohibitively expensive for you to crawl and index the web?

arturventura · on Aug 1, 2022

Location: Lisbon, Portugal Remote: Yes

Willing to relocate: Maybe

Technologies: C, C++, Java, JavaScript, Typescript, Python, Lisp, Perl, Golang, SQL, PyTorch, TensorFlow

Resume: https://www.surf-the-edge.com/wp-content/uploads/2022/07/Res...

Email: artur.ventura@gmail.com

LinkedIn: https://www.linkedin.com/in/artur-ventura-48500314/

Github: https://github.com/nurv

I'm a Software Engineer with both backend and frontend experience. My core competence is in Machine Learning working with Natural Language Pipelines.

arturventura · on July 2, 2022

Location: Lisbon, Portugal

Remote: Yes

Willing to relocate: No

Technologies: C, C++, Java, JavaScript, Typescript, Python, Lisp, Perl, PHP, SQL, Swift, PyTorch, TensorFlow

Resume: https://www.surf-the-edge.com/wp-content/uploads/2022/07/Res...

Email: artur.ventura@gmail.com

LinkedIn: https://www.linkedin.com/in/artur-ventura-48500314/

Github: https://github.com/nurv

My professional background is in artificial intelligence. Currently I’m the Tech Lead at Bond Touch, and I’ve worked at Unbabel, a YC company, as an artificial intelligence engineer on the Machine learning team. I also have some interests in quantum computing, financial modeling and robotics.

arturventura · on May 20, 2022

I love watching SpaceTime on youtube! If you watch it, it will give you a surprisingly deep understanding of the state of the art on physics, but is the kind of show that if you don't have a massive background in physics, you either need to be extremely focused to understand it, or blazed out of your mind.

nickthegreek · on May 20, 2022

repeat viewings also help.