Hacker News new | past | comments | ask | show | jobs | submit | fzliu's comments login

Great post!

One quick minor note is that the resulting embeddings for the same text string could be different, depending on what you specify the input type as for retrieval tasks (i.e. query or document) -- check out the `input_type` parameter here: https://docs.voyageai.com/reference/embeddings-api.


(I work at Voyage)

Many of the top-performing models that you see on the MTEB retrieval for English and Chinese tend to overfit to the benchmark nowadays. voyage-3 and voyage-3-lite are also pretty small in size compared to a lot of the 7B models that take the top spots, and we don't want to hurt performance on other real-world tasks just to do well on MTEB.


> we don't want to hurt performance on other real-world tasks just to do well on MTEB

Nice!

Fortunately MTEB lets you sort by model parameter size because using 7B parameter LLMs for embeddings is just... Yuck.


It would still be great to know how it compares?

Why should I pick voyage-3 if for all I know it sucks when it comes to retrieval accuracy (my personally most important metric)?


We provide retrieval metrics for a variety of datasets and languages: https://blog.voyageai.com/2024/09/18/voyage-3/. I also personally encourage folks to either test on their own data or to find an open source dataset that closely resembles the documents they are trying to search (we provide a ton of free tokens for the evaluating our models).

Not my website, but I agree - it is fun indeed.


The author aquajet first posted it as a ShowHN about a month ago it seems.


Agreed. ViTs are better if you're looking to go multimodal or use attention-specific mechanisms such as cross-attention. If not, there's evidence out there that ViTs are not better than convnets for small networks and at scale (https://frankzliu.com/blog/vision-transformers-are-overrated).


ViTs also have proven to be more effective for zero-shot generalization tasks due to their ability to capture global context and relationships in the input data, which CNNs struggle with.

https://arxiv.org/abs/2304.02643


A lot of folks I've spoken with say that single-agent systems are still extremely limited, let alone multi-agent platforms. In general, it seems to boil down to:

- Agents need lots of manual tuning and guardrails to make them useful

- Agents with too many guardrails are not general-purpose enough to be worth the time and effort to build

I believe truly great agents will only come from models whose weights are dynamically updated. I hope I'm wrong.


> I think a few details (and perhaps a small amount of customization) would go a long way.

I hear you and agree 100% - I unfortunately haven't gotten around to writing better documentation nor solid code samples that utilize Radient yet.

Regarding molecule vectorization: that capability comes from RDKit (https://rdkit.org) - I just uploaded a sample to the /examples directory. You're right that molecule-to-audio and audio-to-molecule search is nonsensical from a semantic perspective, but I could see a third modality such as text or images that ties the two together, similar to what ImageBind is doing (https://arxiv.org/abs/2305.05665).


> You're double-dipping into the data. You look at the performance, then you tune some part of the network by hand, see if it helps, and then keep doing that. It's testing on the training data.

I purposely tried to avoid adding any niche network modifications that would help it overfit to in-1k. All three of the modifications are applicable to other networks and datasets.

> No one cares about ImageNet-1k. No one needs to classify ImageNet-1k in real life.

I completely agree with you; I just don't have the compute to train this on a massive dataset. With that being said, I'm not advocating for taking an in-1k model and putting it into production. I'm merely saying we can get ResNet to reach the same level of performance as ViTs. And that there's evidence that convnets reach that same level of performance at scale.

> What you want is a ViT that's seen massive amounts of data so that your embeddings don't become degenerate because they're far away from ImageNet-1k!

https://arxiv.org/pdf/2310.16764.pdf


> purposely tried to avoid adding any niche network modifications that would help it overfit to in-1k. All three of the modifications are applicable to other networks and datasets.

You introduce a ridiculous number of degrees of freedom from your architectural modifications and the vast majority of them don’t seem to be some standard way of doing things. If these are generally applicable to most datasets, then show an eval on a new dataset against ViT that you haven’t looked at while making your arch modifications.

The way this is described per the blog is just not sufficient for the sweeping titular claim without more robust evaluation without overfitting/p-hacking


I recommend checking out this paper: https://arxiv.org/pdf/2310.16764.pdf. From the concluding section:

"Although the success of ViTs in computer vision is extremely impressive, in our view there is no strong evidence to suggest that pre-trained ViTs outperform pre-trained ConvNets when evaluated fairly."

Neural networks can be unpredictable, and there's evidence that questions how important transformers' lack of inductive bias (at scale) really is.


The >10 fingers problem should be solved via higher quality data or conditioning rather than a better model architecture.


Are you suggesting AI generates images of people with more than 10 fingers because there are too many pictures of people with more than 10 fingers in the input data? That seems unlikely..


Higher quality could also mean the description attached to the original images in the dataset. If you were to describe a picture of someone you would never specifically call out the fact that they have 5 fingers per hand, we take it for granted, so that kind of information may never appear in the dataset.

But I think what the grant parent means by conditioning is non-textual conditioning, like ControlNet. This will always be more powerful than trying to describe something by text. Think about the description of a character in a novel vs the movie adaptation.


In some pictures not all 5 fingers on a hand are visible, so maybe it appears that humans have a variable amount of fingers?


Humans with less than 5 fingers per hand do indeed exist. How does that lead to a default of 7?


I meant more because people are holding objects or their hand is just at an angle where all fingers are not visible in the picture.

I don't understand how neutral networks operate, but my layman's guess is that when you sometimes see hands with 5 fingers visible, sometimes 4, sometimes 3, sometimes 2, sometimes 1, and sometimes 0, then it's not immediately apparent that it means that every hand has between 5 and 0 fingers.

Think of it this way, if the AI has ever only seen houses with a maximum of 10 windows in it's entire training set, is it so unthinkable that it sometimes draws a house with 12 windows? that's a "sensible" understanding about how houses have a variable amount of windows. It just doesn't work for fingers.

I'm sure the same issue would arise if humans had other body parts that came in large quantities, but almost everything else is either 1 or 2 like the nose or eyes.


Counting fingers is how humans do it, not necessarily how AI does. On a five-fingered hand, fingers are more likely to have neighbors than not. Why wouldn't the default be infinite fingers?


People holding hands, people high-fiving, etc.


Yes, but if AI cannot solve the fingers problem, then can it reliably generate images of things which we have relatively few example images of? We have an enormous amount of images of hands.


it's exactly the reverse: the issues with generative AI isn't the data anymore, but the models that are not able to understand the data


Outside of the work to study and improve available technologies ("how far can we push a hammer"),

you do not draw an individual with anomalous hands because you have an ontological model in which "humans normally have five fingers per hand".

Knowing "how the world works" is the appropriate source for subsequent expression of a representation.

Somewhere in the architecture a world model should be formed.


This will be big for FPGAs - adders are extremely cheap compared to multipliers and other DSP blocks.


Multipliers for eg 8 bit or 4 bit floating point values should also be pretty cheap? (I assume multipliers have a cost that grows quadratically with the number of bits?)


You use DSPs for that. Effinix has direct bfloat16 support in their FPGAs. The real game changer is using the carry chain with your LUT based adders. Assuming 16 LUTs, you could be getting 11 teraops out of a Ti180 using a few watts. Of course that is just a theoretical number though but I could imagine using four FPGAs for speech recognition and synthesis and vision based LLMs operating in real time.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: