Rendered at 21:02:16 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
nighthawk454 16 hours ago [-]
Some potentially related stuff on the topic:
Anisotropy in word embeddings dates back to at least 2017 with word2vec - where there were zero layers.
The cone-shaped anisotropy in transformers is known since at least Gao et al. 2019. That lineage explained it fairly intuitively as an artifact of word frequency and softmax geometry (so a training dynamic).
A variety of papers followed up by adding post-hoc ‘whitening’ steps (from classical statistics/NLP), then adding regularizers to the loss to penalize the anisotropy, eventually penalizing the covariance matrix (a la VICReg), and then the SIGReg method as a computationally much cheaper way to approximate the full covariance.
As another commenter pointed out it’s also similar to the InfoNCE/contrastive learning objectives. Where terms to increase uniformity (spread out evenly) on the hyper sphere were added. Like the SimCSE (Gao 2021) paper or the excellent alignment/uniformity breakdown from Wang & Isola 2020.
This proposed dispersion loss seems to be similar in that it pushes things apart by penalizing cosine similarity. Although this one works on the tokens within one sequence. Usually contrastive methods mean pool the sequences and then contrast against the other pooled sequences in the batch.
estebarb 18 hours ago [-]
That sounds very similar to what we know in self-supervised learning as representation collapse. I wonder if we could copy some of the anti-collapse mechanisms from SSL into GPT... after all, they are ways to increment the differential entropy. However, I'm not sure if it could be useful after all: any pure function cannot produce more entropy than the entropy it receives... and natural language as text has much less entropy than other domains... [edit: typos]
aetherspawn 21 hours ago [-]
It makes sense to me that distributing across more parameters results in models that can be quant more heavily (information theory - more bits available)
I wonder if anyone has figured out how the information is compressed and calculated the amount of information an LLM can hold depending on its size
woadwarrior01 21 hours ago [-]
> I wonder if anyone has figured out how the information is compressed and calculated the amount of information an LLM can hold depending on its size
You might want to look at Physics of Language Models[1]. IIRC, the authors estimate it to be ~2 bits of factual knowledge per parameter.
Anisotropy in word embeddings dates back to at least 2017 with word2vec - where there were zero layers.
The cone-shaped anisotropy in transformers is known since at least Gao et al. 2019. That lineage explained it fairly intuitively as an artifact of word frequency and softmax geometry (so a training dynamic).
A variety of papers followed up by adding post-hoc ‘whitening’ steps (from classical statistics/NLP), then adding regularizers to the loss to penalize the anisotropy, eventually penalizing the covariance matrix (a la VICReg), and then the SIGReg method as a computationally much cheaper way to approximate the full covariance.
As another commenter pointed out it’s also similar to the InfoNCE/contrastive learning objectives. Where terms to increase uniformity (spread out evenly) on the hyper sphere were added. Like the SimCSE (Gao 2021) paper or the excellent alignment/uniformity breakdown from Wang & Isola 2020.
This proposed dispersion loss seems to be similar in that it pushes things apart by penalizing cosine similarity. Although this one works on the tokens within one sequence. Usually contrastive methods mean pool the sequences and then contrast against the other pooled sequences in the batch.
I wonder if anyone has figured out how the information is compressed and calculated the amount of information an LLM can hold depending on its size
You might want to look at Physics of Language Models[1]. IIRC, the authors estimate it to be ~2 bits of factual knowledge per parameter.
[1]: https://physics.allen-zhu.com/
Big topic early 2020s