mercurybee's comments

mercurybee · on April 19, 2024

It's more common to refer to embeddings as low-dimensional.

epistasis · on April 19, 2024

Can you give an example of that?

I've rarely seen embeddings with fewer than hundreds of dimensions.

UMAP/T-SNE are dimensional reduction techniques that could maybe considered embeddings, but I haven't encountered that in anything that relates to word2vec or LLMs or much of the current AI fashion.

nmfisher · on April 20, 2024

If your vocabulary size is 10000, then that's also your initial one-hot "dimension". Projecting down to 512 floats (or whatever) is, relatively speaking, low dimension.

epistasis · on April 20, 2024

Ah, I see, I understand but I haven't encountered that terminology used that way. I tend to think of one-hot encoding as not a vector space, however, as the information is actually only lg2(10000) per token. However the embedding vector of 512 or however many floats includes lots of information from positionally related tokens, so it's quite a bit more than lg2(vocab).

karma_pharmer · on April 20, 2024

I tend to think of one-hot encoding as not a vector space

But it is a vector space. Technically the one-hot elements are the basis of the vector space, whose elements are weighted lists of words.

And it is the vector space with the highest useful dimension for embedding that vocabulary. Seriously, what are you going to do with the extra axes if you're representing a 10,000-word vocabulary using 1,000,000-element vectors? Those extra axes simply go to waste. In the one-hot case the only thing wasted is the magnitudes.

This is why word2vec embeddings are considered lower-dimensional embeddings. They're lower-dimension than the most-naive-but-not-wasteful representation.

epistasis · on April 21, 2024

But when does anyone actually use a one-hot encoding as a vector space? They are used as a lookup table into a different representation, and the one-hit encoding is there to simplify mathematical description, not to be used as a vector space. Or have you experienced systems where people use one-hit encodings?

While I would agree that technically, in terms of raw dimensionality, a one-hit encoding is larger dimension than the embedding, that would be a "lower" dimensionality, not "low" dimensionality. Who uses "low" as an absolute term when referring to 300-2000 dimensions?

srean · on April 21, 2024

Thought of letting this one go, but then changed my mind.

> I tend to think

> I havent encountered

Egocentric views of things are far less notable and far less interesting than what those things actually are. It makes for tiring reading. Hope it does not come off as being too testy, my apologies if it does.

epistasis · on April 21, 2024

When we are talking about two people trying to establish common language, is it not important to talk about one has seen personally? I didn't do it to boost my ego, very odd to think that it came across that way!

srean · on April 21, 2024

>I didn't do it to boost my ego

I dont doubt that for a minute and neither did I intend such an interpretation.

Perhaps if I were to debate a well known and well understood concept in epigenetics in terms how I personally think about it (with not even an iota of intent to boost my ego), my comment above might resonate better. For the purpose of such a hypothetical debate, how I personally think about that concept in epigenetics, becomes more of a distraction. Does it not ?

No offense intended and none taken. I do learn a lot from your comments, especially about the intricacies of biology.

mercurybee · on April 19, 2024

It seems to be a problem with the certificate. The http version is good.

It's kind of funny the way I noticed this: I consume the RSS feed for the Paul Graham's essays he has setup in there (https://www.aaronsw.com/2002/feeds/pgessays.rss). When I check my RSS reader for problematic feeds, sure enough I got an error for that feed.

Anyway, is anyone responsible for that site?

exegete · on April 20, 2024

The cert expired in 2014.