Simply, it's when your output embedding matrix = input.
You save vocab_dim*model_dim params (ex. 617m for GPT-3).
But the residual stream means that the weight matrices are roughly connected via a matmul, which means they struggle to encode bigrams (commutative property enforces symmetry).
Attention + MLP adds nonlinearity, but it still means less expressivity.
Which is why they aren't SOTA, but are useful in smaller models.
There isn't a specific place, it's the general aesthetic. Maybe you do sound like an LLM :P I guess it's not unlikely to pick up some mannerisms from them when everyone is using them.
I guess I don't really mind the use of an LLM or not, it's more the style that sounds very samey with everything else. Whether it's an LLM or not is not very relevant, I guess.
Try Cursor Composer! It's the most natural transition. Exactly what you're currently doing, but it inserts the code snippets for you from within your IDE.
You save vocab_dim*model_dim params (ex. 617m for GPT-3).
But the residual stream means that the weight matrices are roughly connected via a matmul, which means they struggle to encode bigrams (commutative property enforces symmetry).
Attention + MLP adds nonlinearity, but it still means less expressivity.
Which is why they aren't SOTA, but are useful in smaller models.
reply