The 'point' skill is trained on a ton of UI data; we've heard of a lot of people using it in combination with a bigger driver model for UI automation. We are also planning on post-training it to work end-to-end for this in an agentic setting before the final release -- this was one of the main reasons we increased the model's context length.
Re: chart understanding, there are a lot of different types of charts out there but it does fairly well! We posted benchmarks for ChartQA in the blog but it's on par with GPT5* and slightly better than Gemini 2.5 Flash.
* To be fair to GPT5, it's going to work well on many more types of charts/graphs than Moondream. To be fair to Moondream, GPT5 isn't really well suited to deploy in a lot of vision AI applications due to cost/latency.
Cool project! The codebase is simple and well documented, a good starting point for anyone interested in how to implement a high-performance inference engine. The prefix sharing is very relevant for anyone running batch inference to generate RL rollouts.
Yes, they even do at $1/GPU/hr. However, 8xH100 cluster at full utilization is ~8kWh of electricity and costs almost ~0.5M$. 16xH100 cluster is probably 2x of that. How many years before you break-even at ~24$/GPU/day income?
You should care about counterparty risks. If your business model depends on unsustainable 3rd party prices powered by VC largesse and unrealizable dreams of dominance, the very least you can do is plan for the impending reckoning, after which GPU proces will be determined by costs.
Look, I understand that some people are short-sighted and can hardly think out of the box and that is totally fine by me. I don't judge you for being that so I kindly ask you not to judge my question. Learn to give some benefit of the doubt.
That's the point of MoE. Sacrificing VRAM for compute/RAM bandwidth which makes it harder sell for consumer devices but easier for server devices where things are more likely to be compute or RAM bandwidth bound.
The training technique used here (fitting something similar to a NeRF to different views of the same image) is pretty similar to this paper which uses a similar technique to denoise (instead of upscale) output features: https://arxiv.org/abs/2401.02957
Yeah, but who’s hiring threads the company is supposed to read your skills and respond if you are an actual fit, not just send it to every single person who posted. I’d bet dollars to donuts that half the candidates they sent it to don’t even qualify for the position.
I think the assumption here is that the company claims they looked at your qualifications and decided it might be worth your effort to apply, when in fact they didn't.
If that's what's happening, it's a form of fraud (but legal, I imagine).
You're right. You are confused. He or she didn't get an email like that. It was an email that said, "Saw your profile on HN and we think your skills look like a good fit for our team." No one saw the profile and thought that. Re "wondered if you'd be interested in our YC company," no one wondered that.
This was a deceptive email meant only to advertise the fact that Anima 1) exists, and 2) is accepting applications. It should have been posted in the "Who is hiring?" thread.
Do outlier features emerge in sub-100M parameter models? I haven't seen any research discuss it below the 124M scale (bert-base). At that scale training a model takes ~4 days on an 8xA100 node.
That is a fair question, and in addition I'm unsure that a simple metric like perplexity is likely to pick it up.
However, I do think that if perplexity showed a lower drop-off using this modified softmax under quantization that would be an exciting finding and enough to indicate further experiments would definitely be worth doing.
But you are right - if it doesn't show an improvement it doesn't necessarily rule out that it could be helping.
Edit: In the Qualcomm AI paper mentioned in this post, they experiment on BERT uncased (109B param) and OPT 125M and are able to show the effects using perplexity.
I hadn't read the paper when I suggested the same approach, so I guess that is good validation it is worth trying.
Edit2: Actually they also test on ViT 22M, which would be even quicker to try I think.
Re: chart understanding, there are a lot of different types of charts out there but it does fairly well! We posted benchmarks for ChartQA in the blog but it's on par with GPT5* and slightly better than Gemini 2.5 Flash.
* To be fair to GPT5, it's going to work well on many more types of charts/graphs than Moondream. To be fair to Moondream, GPT5 isn't really well suited to deploy in a lot of vision AI applications due to cost/latency.