Hacker Newsnew | past | comments | ask | show | jobs | submit | albertgoeswoof's commentslogin

What inference performance are you getting on this with llama?

How long would it take to recoup the cost if you made the model available for others to run inference at the same price as the big players?


He has GLM 4.5 Running at ~100 Tokens per second.

Assumptions:

Batch 4x and get 400 tokens per second and push his power consumption to 900W instead of the underutilized 300W.

Electricity around €0.2/kWhr.

Tokens valued at €1/1M out.

Assume ~70% utilization.

Result:

You get ~1M tokens per hour which is a net profit of ~€0.8/hr. Which is a payoff time of a bit over a year or so given the €9K investment.

Honestly though there is a lot of handwaving here. The most significant unknown is getting high utilization with aggressive batching and 24/7 load.

Also the demand for privacy can make the utility of the tokens much higher than typical API prices for open source models.

In a sort of orthogonal way renting 2 H100s costs around $6 per hour which makes the payback time a bit over a couple months.


> He has GLM 4.5 Running at ~100 Tokens per second.

GLM 4.5 Air, to be precise. It's a smaller 166B model, not the full 355B one.

Worth mentioning when discussing token throughput.


I'm downloading DeepSeek-V3.2-Speciale now at FP8 (reportedly Gold-medal performance in the 2025 International Mathematical Olympiad and International Olympiad in Informatics).

It will fit in system RAM, and as its mixture of experts and the experts are not too large, I can at least run it. Token/second speed will be slower, but as system memory bandwidth is somewhere around 5-600Gb/s, so it should feel OK.


Check out "--n-cpu-moe" in llama.cpp if you're not familiar. That allows you to force a certain number of experts to be kept in system memory while everything else (including context cache and the parts of the model that every token touches) is kept in VRAM. You can do something like "-c128k -ngl 99 --n-cpu-moe <tuned_amt>" where you find a number that allows you to maximize VRAM usage without OOMing.


This is about more. I can run 600B+ models at home. Today I was having a discussion with my wife and we asked ChatGPT a quick question, it refused because it can't generate the result based on race. I tried to prompt it to and it absolutely refused. I used my local model and got the answer I was looking for from the latest Mistral-Large3-675B. What's the cost of that?


about the cost of your hardware lol


The author was running a quantised version of GLM 4.5 _Air_, not the full fat version. API pricing for that is closer to $0.2/$1.1 at the top end from z.ai themselves, half the price from Novita/SiliconFlow.


Running LLM's directly might not be effective.

I think there are probably Law Firms/doctors offices that would gladly pay ~3-4K euro a month to have this thing delivered and run truely "on-prem" to work with documents they can't risk leaking (patent filings, patient records etc).

For a company with 20-30 people, the legal and privacy protection is worth the small premium over using cloud providers.

Just a hunch though! This would have it paid-off in 3-4 months?


https://mailpace.com is fully European based and independent


They are based in the UK. That is technically Europe, but I believe for privacy regulations it isn't the same as a EU-country, but I could be very wrong. Would love to be educated on this by someone.


UK inherited the same gdpr from the EU, so practically it remains the same.

MailPace data is also hosted in the EU only


Currently at the millions stage with https://mailpace.com relying mostly on Postgres

Tbh this terrifies me! We don’t just have to log the requests but also store the full emails for a few days, and they can be up to 50 mib in total size.

But it will be exciting when we get there!


My favourite band (king gizzard) removed all their music from Spotify. I took the opportunity to switch to navidrome with tailscale and started obtaining music via bandcamp and ripping old CDs. It works much better than I expected, even transcoding from flac to mp3 on the fly from my phone app.

Investing the Spotify fee every month into my own music collection is a great investment, and it has meant that I am actually listening to the music and not just playing the same songs off a Spotify playlist every now and then again


For the dead comment asking about whether this is a vscode fork, it’s not- it’s a completely new, custom word processor written in rust from the ground up


Try https://mailpace.com

The lowest plan $40/year for 1k emails/month isn’t on the Pricing page, but you can select it when signing up.


Sounds expensive. Amazon SES has 1k emails/month included for free (if you use an API to send). When sending via SMTP that quota does not apply, but still 1k Emails just costs 0.1$ (yes, 10 cents). I do not use any other AWS services but SES for my emails because of the pricing, I host everything else on Hetzner.


Yes but AWS SES emails don't get delivered to inboxes


That doesn't seem like even close to the truth, else Amazon SES would have no business. I use it myself in my Webapp to deliver signup verification and haven't gotten a single complaint so far.


Thanks for recommending mailpace, £7.50/month for 10,000 emails is very reasonable, _and_ they support idempotency! Definitely makes me consider switching to them..


Been using Mailpace for a few years.

Has been a 10/10 experience -- rock solid and extremely good deliverability.

Wish the pricing increased non-linearly though at higher volumes.


Thanks. It's not very smart to not list that plan in the pricing page IMO.


Or migadu for 19/yr


Migadu is more for personal emails - they aren't meant for transactional emails at all.


Here’s an example https://contextsync.dev/


This is what you’re looking for: https://tritium.legal/


I saw this on HN before, but how is it for litigation?


Another source to back up the first claim https://carnegieuk.org/blog/online-safety-and-carnegie-uk/

I would like to see much more thorough journalism on the origin of these laws


3 year old M1 MacBook Pro 32gb, 42 tokens/sec on lm studio

Very much usable


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: