OLMo uses open datasets, such as CommonCrawl and StackOverflow, for training, about 5TB worth of text. I wonder how well it would perform if it was also trained on Annas Archive/LibGen (>600TB).
A possibly better question could be how well it would perform if it was trained on selected material - see the efforts of Mortimer Adler in the USA, or the efforts of any good publishing house in the definition of editorial collections.
But I remain skeptical that without "critical thinking as a condition to write into "conscious" memory" the barrier of "conformism" will ever be broken.
Not a lawyer but would assume downloading material from libgen is, in the vast majority of cases, illegal because it's a breach of copyright or similar. That’s gotten Meta in quite a spectacle of late [1]
CommonCrawl is composed of copyrighted contents too. You gain copyright on your work automatically the moment you created it, including this very comment.
One could argue that using copyrighted content in LLMs, much like reposting, should fall under fair use. This is also Microsoft's claim in the GitHub Copilot lawsuits. It's up to the court to decide though. (IANAL)
It’s a catchy term, but loaded. Copyright protects only original expression, not ideas and information. So if a computer algorithm reads the former and outputs the latter, arguably copyright isn’t involved at all.
There are plenty of good counterarguments to this as well, when you consider the effects of automation and scale. I’m definitely interested in seeing how the jurisprudence develops as these cases go through the courts.
I have struggled with SVG generation with just about all models, the SVG demo for this model is more or less that I get from much larger models.
Am I doing something wrong? Everyone seems to say how well models work in producing SVGs but I get shapes in all sorts of the wrong places. SVG documents are quite low level (verses editing them in Inkscape or Illustrator) so its tricky to modify, beyond very simple shapes.
The models are mostly terrible at SVG output, at least if you ask for something that's hard (or impossible?) to draw like a pelican riding a bicycle. That's why I use it as a benchmark, I think it's amusing: https://simonwillison.net/tags/pelican-riding-a-bicycle/
Some of them can do good SVGs for things that make sense, like simple diagrams.
These works well for some svg that are simple and already in training data but doesn't work for harder svgs, even simple one if they are out of distribution of training data.
In simon's example whole purpose is to make it draw something that it has not seen before but can easily infer from geometry, spatial arrangement. I think it makes a fun problem.
I think it’s a big deal to see a fully open LLM now achieving this level of quality. While the partially open releases we’ve seen from the big labs are are quite valuable, models like OLMo-2 are the only way that researchers can truly study this technology to answer questions about how the models’ capabilities are shaped by their training data and training process.
The closed and partly-closed models rely on a lot of secret sauce, so it’s also just really impressive to see their results being replicated in the open.
In the paramount tasks is to understand the internals of the "black box", get knowledge, engineer better. Of course having "fully open" projects should help that.
I was here wondering if there was a specific reason for MLX behind this model, but (thankfully thinking of openness) nothing to do with the original model.